Message ID | 20190325125142.11628-1-zhengxiang9@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] hw/arm/virt: use variable size of flash device to save memory | expand |
On Mon, 25 Mar 2019 at 12:53, Xiang Zheng <zhengxiang9@huawei.com> wrote: > > Currently we fill the VIRT_FLASH space with two 64MB NOR images when > using persistent UEFI variables on QEMU. Actually we only use a very > small part of the memory while the rest significant large part of > memory is wasted. > > This patch creates and maps a variable size of flash device instead of > a mandatory 64MB one to save memory. > > Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com> > --- > > This patch might be insufficient since it also needs to modify the flash size > in ACPI and DTB. > > BTW, I don't understand why it requires the two NOR images to be exactly 64MB > in size when using -pflash. I don't think we should do this. The board should in general create the same hardware visible to the guest, not change it based on subtle things like the size of the image files. The reason why the flash images must be 64MB in size when using -pflash is that they are the backing store for a writable device. Suppose you have 1MB of data in your backing image that you pass to QEMU and then the guest writes to the last block of the flash device. The new data written by the guest to the end of the device has to be stored somewhere, so the file has to be large enough to cover the whole of the flash area. thanks -- PMM
Hi Peter, Thanks for your reply! On 2019/3/25 21:11, Peter Maydell wrote: > On Mon, 25 Mar 2019 at 12:53, Xiang Zheng <zhengxiang9@huawei.com> wrote: >> >> Currently we fill the VIRT_FLASH space with two 64MB NOR images when >> using persistent UEFI variables on QEMU. Actually we only use a very >> small part of the memory while the rest significant large part of >> memory is wasted. >> >> This patch creates and maps a variable size of flash device instead of >> a mandatory 64MB one to save memory. >> >> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com> >> --- >> >> This patch might be insufficient since it also needs to modify the flash size >> in ACPI and DTB. >> >> BTW, I don't understand why it requires the two NOR images to be exactly 64MB >> in size when using -pflash. > > I don't think we should do this. The board should in general > create the same hardware visible to the guest, not change > it based on subtle things like the size of the image files. > > The reason why the flash images must be 64MB in size > when using -pflash is that they are the backing store > for a writable device. Suppose you have 1MB of data in your > backing image that you pass to QEMU and then the guest writes > to the last block of the flash device. The new data > written by the guest to the end of the device has to be > stored somewhere, so the file has to be large enough > to cover the whole of the flash area. > Is there any way to support config or limit the size that both guest and QEMU are visible? The original QEMU_EFI.fd has only 2M, but we need to stuff it to 64M with 62M unused data. It will consume a large amount of memory when running multiple VM simultaneously.
On 03/25/19 14:11, Peter Maydell wrote: > On Mon, 25 Mar 2019 at 12:53, Xiang Zheng <zhengxiang9@huawei.com> wrote: >> >> Currently we fill the VIRT_FLASH space with two 64MB NOR images when >> using persistent UEFI variables on QEMU. Actually we only use a very >> small part of the memory while the rest significant large part of >> memory is wasted. >> >> This patch creates and maps a variable size of flash device instead of >> a mandatory 64MB one to save memory. >> >> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com> >> --- >> >> This patch might be insufficient since it also needs to modify the flash size >> in ACPI and DTB. >> >> BTW, I don't understand why it requires the two NOR images to be exactly 64MB >> in size when using -pflash. > > I don't think we should do this. The board should in general > create the same hardware visible to the guest, not change > it based on subtle things like the size of the image files. I consider this question/behavior board-specific. For the "virt" board, I agree with you (Peter), simply because that's how the "virt" board has always worked. (See also the tangentially related series [Qemu-devel] [PATCH v8 0/2] pflash: Require backend size to match device, improve errors at http://mid.mail-archive.com/20190319163551.32499-1-armbru@redhat.com . This series still lets each board decide the question for itself, but it improves error detection and reporting.) Thanks, Laszlo > The reason why the flash images must be 64MB in size > when using -pflash is that they are the backing store > for a writable device. Suppose you have 1MB of data in your > backing image that you pass to QEMU and then the guest writes > to the last block of the flash device. The new data > written by the guest to the end of the device has to be > stored somewhere, so the file has to be large enough > to cover the whole of the flash area. > > thanks > -- PMM >
Zheng Xiang <zhengxiang9@huawei.com> writes: > Hi Peter, > > Thanks for your reply! > > On 2019/3/25 21:11, Peter Maydell wrote: >> On Mon, 25 Mar 2019 at 12:53, Xiang Zheng <zhengxiang9@huawei.com> wrote: >>> >>> Currently we fill the VIRT_FLASH space with two 64MB NOR images when >>> using persistent UEFI variables on QEMU. Actually we only use a very >>> small part of the memory while the rest significant large part of >>> memory is wasted. >>> >>> This patch creates and maps a variable size of flash device instead of >>> a mandatory 64MB one to save memory. >>> >>> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com> >>> --- >>> >>> This patch might be insufficient since it also needs to modify the flash size >>> in ACPI and DTB. >>> >>> BTW, I don't understand why it requires the two NOR images to be exactly 64MB >>> in size when using -pflash. >> >> I don't think we should do this. The board should in general >> create the same hardware visible to the guest, not change >> it based on subtle things like the size of the image files. Concur. >> The reason why the flash images must be 64MB in size >> when using -pflash is that they are the backing store >> for a writable device. Suppose you have 1MB of data in your >> backing image that you pass to QEMU and then the guest writes >> to the last block of the flash device. The new data >> written by the guest to the end of the device has to be >> stored somewhere, so the file has to be large enough >> to cover the whole of the flash area. >> > > Is there any way to support config or limit the size that both > guest and QEMU are visible? > > The original QEMU_EFI.fd has only 2M, but we need to stuff it > to 64M with 62M unused data. It will consume a large amount of > memory when running multiple VM simultaneously. Here's a number of ideas. The first one is of course making the flash memory size configurable, to eliminate the "unused" part. Our PC machines use the backing image sizes as configuration. I consider that a bad idea that should not be allowed to spread to other machines. Peter seems to agree. The accepted way to create minor variations of a machine type is machine properties. Whether using them to vary flash chip size would be acceptable is for the board maintainer to decide. Now let's think about mapping these flash images more efficiently. We could avoid backing their "unused" part with pages. Unless the "unused" part is read-only, this leaves you at your guests' mercy: they can make the host allocate pages by writing to them. We could share the pflash memory among VMs running the same firmware. If writes are permitted, we'd have to unshare on write (COW). Again, you're at your guests' mercy unless read-only: they can make the host unshare pages by writing to them. I figure the "share" idea would be easier to implement[*]. Both ideas require either trusted guests or read-only flash. EFI generally wants some read/write flash for its var store. Can we make everything else read-only? We can improve the device models to let us set up a suitable part of the pflash memory read-only. This is how real hardware works. Our PC machines currently approximate this with *two* flash chips, one read-only, one read/write, which I consider a mistake that should not be allowed to spread to other machines. Prior discussions Message-ID: <87imxfssq1.fsf@dusky.pond.sub.org> https://lists.nongnu.org/archive/html/qemu-devel/2019-02/msg05056.html and Message-ID: <87y378n5iy.fsf@dusky.pond.sub.org> https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html [*] If you run KSM (kernel same-page merging), the kernel should set up the sharing for you automatically. But not everybody wants to run KSM.
On 03/26/19 07:17, Markus Armbruster wrote: > Zheng Xiang <zhengxiang9@huawei.com> writes: > >> Hi Peter, >> >> Thanks for your reply! >> >> On 2019/3/25 21:11, Peter Maydell wrote: >>> On Mon, 25 Mar 2019 at 12:53, Xiang Zheng <zhengxiang9@huawei.com> wrote: >>>> >>>> Currently we fill the VIRT_FLASH space with two 64MB NOR images when >>>> using persistent UEFI variables on QEMU. Actually we only use a very >>>> small part of the memory while the rest significant large part of >>>> memory is wasted. >>>> >>>> This patch creates and maps a variable size of flash device instead of >>>> a mandatory 64MB one to save memory. >>>> >>>> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com> >>>> --- >>>> >>>> This patch might be insufficient since it also needs to modify the flash size >>>> in ACPI and DTB. >>>> >>>> BTW, I don't understand why it requires the two NOR images to be exactly 64MB >>>> in size when using -pflash. >>> >>> I don't think we should do this. The board should in general >>> create the same hardware visible to the guest, not change >>> it based on subtle things like the size of the image files. > > Concur. > >>> The reason why the flash images must be 64MB in size >>> when using -pflash is that they are the backing store >>> for a writable device. Suppose you have 1MB of data in your >>> backing image that you pass to QEMU and then the guest writes >>> to the last block of the flash device. The new data >>> written by the guest to the end of the device has to be >>> stored somewhere, so the file has to be large enough >>> to cover the whole of the flash area. >>> >> >> Is there any way to support config or limit the size that both >> guest and QEMU are visible? >> >> The original QEMU_EFI.fd has only 2M, but we need to stuff it >> to 64M with 62M unused data. It will consume a large amount of >> memory when running multiple VM simultaneously. > > Here's a number of ideas. > > The first one is of course making the flash memory size configurable, to > eliminate the "unused" part. Our PC machines use the backing image > sizes as configuration. I consider that a bad idea that should not be > allowed to spread to other machines. Peter seems to agree. For the full picture, we have to realize that flash size is a trade-off for virtual machines too, not just for physical machines. UEFI firmware tends to grow like an OS -- whether that's good or bad belongs on another page :) --, and so you tend to need more and more space for it, even after Link Time Optimization, and image compression, over time. So you are left with a choice. * You can let board logic size the flash dynamically at startup (and then call it a "bad idea" :) ). * Alternatively, you can hardcode the pflash size in the board setup, perhaps depdent on machine type version, and/or dependent on some properties. Even assuming that this diversity doesn't break migration at once, you'll create a dependency between firmware releases (sizes) and machine types. 'For the next release of ArmVirtQemu/edk2, you'll need "virt-4.1" or later'. In addition, in most cases, the firmware, when it runs from flash, cannot dynamically adapt itself to random flash sizes and/or base addresses, so not only will new (larger) firmware not fit on old machine types, but old (small) firmware may also not feel "at home" on new machine types. (Note: this is not a theoretical limitation, but it is a *very* practical one.) That's a kind of mapping that is taken for "obvious" in the physical world (you get the board, your firmware comes with it, that's why it's called *firm*), but it used to be frowned upon in the virtual world. * Or else, you pad the pflash chips as broadly as you dare, in order to never run into the above mess -- and then someone complains "it consumes too many resources". :) For some perspective, OVMF started out with a 1MB cumulative size (executable + varstore). We ran out of 1MB with executable code, so we introduced the 2MB build (with unchanged varstore size). Then, under Microsoft's SVVP checks, the varstore size proved insufficient, and we introduced the 4MB buid, with enough space in both the executable part and the varstore part for them to last for "the foreseeable future". And, the dynamic logic in the PC board code allows up to a 8MB cumulative size (and that's also not a machine type property, but a cold hard constant). With the dynamic sizing in QEMU (which, IIRC, I had originally introduced still in the 1MB times, due to the split between the executable and varstore parts), both the 1MB->2MB switch, and the 2MB->4MB switch in the firmware caused zero pain in QEMU. And right now, 4MB looks like a "sweet spot", with some elbow room left. The "virt" board took a different approach, with different benefits (no dynamic sizing at startup, hence identical guest view) and different problems (it's more hungry for resources). "Pick your poison." A hopefully more constructive comment below: > The accepted way to create minor variations of a machine type is machine > properties. Whether using them to vary flash chip size would be > acceptable is for the board maintainer to decide. > > Now let's think about mapping these flash images more efficiently. > > We could avoid backing their "unused" part with pages. Unless the > "unused" part is read-only, this leaves you at your guests' mercy: they > can make the host allocate pages by writing to them. First, the flash backends' on-disk demand shouldn't be catastrophic, even if we use "raw" -- the executable should be shared between all VMs running on the host, and the varstore files (which must be private to VMs) could be created as sparse files. For example, libvirt could be improved to create sparse copies of the varstore *template* files. Second, regarding physical RAM consumption: disable memory overcommit, and set a large swap. The unused parts of the large pflash chips should be swapped out at some point (after their initial population from disk), and should never be swapped back in again. Thanks Laszlo > > We could share the pflash memory among VMs running the same firmware. > If writes are permitted, we'd have to unshare on write (COW). Again, > you're at your guests' mercy unless read-only: they can make the host > unshare pages by writing to them. > > I figure the "share" idea would be easier to implement[*]. > > Both ideas require either trusted guests or read-only flash. EFI > generally wants some read/write flash for its var store. Can we make > everything else read-only? > > We can improve the device models to let us set up a suitable part of the > pflash memory read-only. This is how real hardware works. Our PC > machines currently approximate this with *two* flash chips, one > read-only, one read/write, which I consider a mistake that should not be > allowed to spread to other machines. > > Prior discussions > > Message-ID: <87imxfssq1.fsf@dusky.pond.sub.org> > https://lists.nongnu.org/archive/html/qemu-devel/2019-02/msg05056.html > > and > > Message-ID: <87y378n5iy.fsf@dusky.pond.sub.org> > https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html > > > > [*] If you run KSM (kernel same-page merging), the kernel should set up > the sharing for you automatically. But not everybody wants to run KSM. >
Laszlo Ersek <lersek@redhat.com> writes: > On 03/26/19 07:17, Markus Armbruster wrote: >> Zheng Xiang <zhengxiang9@huawei.com> writes: >> >>> Hi Peter, >>> >>> Thanks for your reply! >>> >>> On 2019/3/25 21:11, Peter Maydell wrote: >>>> On Mon, 25 Mar 2019 at 12:53, Xiang Zheng <zhengxiang9@huawei.com> wrote: >>>>> >>>>> Currently we fill the VIRT_FLASH space with two 64MB NOR images when >>>>> using persistent UEFI variables on QEMU. Actually we only use a very >>>>> small part of the memory while the rest significant large part of >>>>> memory is wasted. >>>>> >>>>> This patch creates and maps a variable size of flash device instead of >>>>> a mandatory 64MB one to save memory. >>>>> >>>>> Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com> >>>>> --- >>>>> >>>>> This patch might be insufficient since it also needs to modify the flash size >>>>> in ACPI and DTB. >>>>> >>>>> BTW, I don't understand why it requires the two NOR images to be exactly 64MB >>>>> in size when using -pflash. >>>> >>>> I don't think we should do this. The board should in general >>>> create the same hardware visible to the guest, not change >>>> it based on subtle things like the size of the image files. >> >> Concur. >> >>>> The reason why the flash images must be 64MB in size >>>> when using -pflash is that they are the backing store >>>> for a writable device. Suppose you have 1MB of data in your >>>> backing image that you pass to QEMU and then the guest writes >>>> to the last block of the flash device. The new data >>>> written by the guest to the end of the device has to be >>>> stored somewhere, so the file has to be large enough >>>> to cover the whole of the flash area. >>>> >>> >>> Is there any way to support config or limit the size that both >>> guest and QEMU are visible? >>> >>> The original QEMU_EFI.fd has only 2M, but we need to stuff it >>> to 64M with 62M unused data. It will consume a large amount of >>> memory when running multiple VM simultaneously. >> >> Here's a number of ideas. >> >> The first one is of course making the flash memory size configurable, to >> eliminate the "unused" part. Our PC machines use the backing image >> sizes as configuration. I consider that a bad idea that should not be >> allowed to spread to other machines. Peter seems to agree. > > For the full picture, we have to realize that flash size is a trade-off > for virtual machines too, not just for physical machines. UEFI firmware > tends to grow like an OS -- whether that's good or bad belongs on > another page :) --, and so you tend to need more and more space for it, > even after Link Time Optimization, and image compression, over time. Same for physical and virtual machines. > So you are left with a choice. > > * You can let board logic size the flash dynamically at startup (and > then call it a "bad idea" :) ). > > * Alternatively, you can hardcode the pflash size in the board setup, > perhaps depdent on machine type version, and/or dependent on some > properties. The difference between getting flash memory size from the backend vs. getting it from a machine type or property is implicit vs. explicit configuration. Implicit vs. explicit can have ramifications beyond the user interface. For instance, if we ever get around to transferring device configuration in the migration stream, the machine property would surely be part of that, but the size of the backend won't. > Even assuming that this diversity doesn't break migration at > once, I doubt explicit could break anything that implicit couldn't :) > you'll create a dependency between firmware releases (sizes) and > machine types. 'For the next release of ArmVirtQemu/edk2, you'll need > "virt-4.1" or later'. Yes if you tie the size to the machine type. No if you get it from a machine property. > In addition, in most cases, the firmware, when it runs from flash, > cannot dynamically adapt itself to random flash sizes and/or base > addresses, so not only will new (larger) firmware not fit on old machine > types, but old (small) firmware may also not feel "at home" on new > machine types. (Note: this is not a theoretical limitation, but it is a > *very* practical one.) The exact same problem exists for physical machines. You can revise your firmware only within limits set by the board. I don't mean to say this problem isn't worth avoiding for virtual machines. Only that it is neither new nor intractable. > That's a kind of mapping that is taken for "obvious" in the physical > world (you get the board, your firmware comes with it, that's why it's > called *firm*), but it used to be frowned upon in the virtual world. I'm willing to give developers of virtual firmware more flexibility than they get in the physical world. I just happen to dislike "implicit" and "any multiple of 4KiB up to a limit" (because physical flash chips with sizes like 64140KiB do not exist, and virtual ones should not). > * Or else, you pad the pflash chips as broadly as you dare, in order to > never run into the above mess -- and then someone complains "it consumes > too many resources". :) I think that complaint would be exactly as valid for unpadded firmware! Once you get to the point where you care whether each guest's firmware eats up 2MiB or 64MiB, what you *really* want is probably 128KiB per guest plust $whatever shared among all guests. You probably won't care all that much whether $whatever is 2MiB - 128KiB or 64MiB - 128KiB. > For some perspective, OVMF started out with a 1MB cumulative size > (executable + varstore). We ran out of 1MB with executable code, so we > introduced the 2MB build (with unchanged varstore size). Then, under > Microsoft's SVVP checks, the varstore size proved insufficient, and we > introduced the 4MB buid, with enough space in both the executable part > and the varstore part for them to last for "the foreseeable future". > And, the dynamic logic in the PC board code allows up to a 8MB > cumulative size (and that's also not a machine type property, but a cold > hard constant). > > With the dynamic sizing in QEMU (which, IIRC, I had originally > introduced still in the 1MB times, due to the split between the > executable and varstore parts), both the 1MB->2MB switch, and the > 2MB->4MB switch in the firmware caused zero pain in QEMU. And right now, > 4MB looks like a "sweet spot", with some elbow room left. Explicit configuration would've been exactly as painless. Even with pflash sizes restricted to powers of two. > The "virt" board took a different approach, with different benefits (no > dynamic sizing at startup, hence identical guest view) and different > problems (it's more hungry for resources). "Pick your poison." Here's the one I'd like to be able to pick: have a single pflash chip with a read-only part and a read/write part. Map the read-only part so it is shared among guests. Map the read/write part normally. Default the sizes to something that makes sense now, with reasonable elbow room. Make sure there's a way to grow. > A hopefully more constructive comment below: > >> The accepted way to create minor variations of a machine type is machine >> properties. Whether using them to vary flash chip size would be >> acceptable is for the board maintainer to decide. >> >> Now let's think about mapping these flash images more efficiently. >> >> We could avoid backing their "unused" part with pages. Unless the >> "unused" part is read-only, this leaves you at your guests' mercy: they >> can make the host allocate pages by writing to them. > > First, the flash backends' on-disk demand shouldn't be catastrophic, > even if we use "raw" -- the executable should be shared between all VMs > running on the host, and the varstore files (which must be private to > VMs) could be created as sparse files. For example, libvirt could be > improved to create sparse copies of the varstore *template* files. Same story as in memory, really: share the read-only part (i.e. the executable) among the guests, keep the read/write part small. > Second, regarding physical RAM consumption: disable memory overcommit, > and set a large swap. The unused parts of the large pflash chips should > be swapped out at some point (after their initial population from disk), > and should never be swapped back in again. Just providing swap should do the trick, shouldn't it? > Thanks > Laszlo > >> >> We could share the pflash memory among VMs running the same firmware. >> If writes are permitted, we'd have to unshare on write (COW). Again, >> you're at your guests' mercy unless read-only: they can make the host >> unshare pages by writing to them. >> >> I figure the "share" idea would be easier to implement[*]. >> >> Both ideas require either trusted guests or read-only flash. EFI >> generally wants some read/write flash for its var store. Can we make >> everything else read-only? >> >> We can improve the device models to let us set up a suitable part of the >> pflash memory read-only. This is how real hardware works. Our PC >> machines currently approximate this with *two* flash chips, one >> read-only, one read/write, which I consider a mistake that should not be >> allowed to spread to other machines. >> >> Prior discussions >> >> Message-ID: <87imxfssq1.fsf@dusky.pond.sub.org> >> https://lists.nongnu.org/archive/html/qemu-devel/2019-02/msg05056.html >> >> and >> >> Message-ID: <87y378n5iy.fsf@dusky.pond.sub.org> >> https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html >> >> >> >> [*] If you run KSM (kernel same-page merging), the kernel should set up >> the sharing for you automatically. But not everybody wants to run KSM. >>
On 03/26/19 17:39, Markus Armbruster wrote: > Laszlo Ersek <lersek@redhat.com> writes: >> With the dynamic sizing in QEMU (which, IIRC, I had originally >> introduced still in the 1MB times, due to the split between the >> executable and varstore parts), both the 1MB->2MB switch, and the >> 2MB->4MB switch in the firmware caused zero pain in QEMU. And right now, >> 4MB looks like a "sweet spot", with some elbow room left. > > Explicit configuration would've been exactly as painless. Even with > pflash sizes restricted to powers of two. I wrote the patch that ended up as commit 637a5acb46b3 -- with your R-b -- in 2013. I'm unsure if machine type properties existed back then, but even if they did, do you think I knew about them? :) You are right, of course; it's just that we can't tell the future. >> Second, regarding physical RAM consumption: disable memory overcommit, >> and set a large swap. The unused parts of the large pflash chips should >> be swapped out at some point (after their initial population from disk), >> and should never be swapped back in again. > > Just providing swap should do the trick, shouldn't it? Yes, it should. (I just find any given swap setting more trustworthy if memory overcommit is disabled -- for me, disabling memory overcommit comes first. I trust "this swap should be large enough" much more if the kernel enforces allocations and the OOM killer can't surprise me later. I know we don't check g_malloc/g_new in QEMU, but I still prefer if those crash QEMU over the OOM killer picking a victim.) Thanks, Laszlo
Laszlo Ersek <lersek@redhat.com> writes: > On 03/26/19 17:39, Markus Armbruster wrote: >> Laszlo Ersek <lersek@redhat.com> writes: > >>> With the dynamic sizing in QEMU (which, IIRC, I had originally >>> introduced still in the 1MB times, due to the split between the >>> executable and varstore parts), both the 1MB->2MB switch, and the >>> 2MB->4MB switch in the firmware caused zero pain in QEMU. And right now, >>> 4MB looks like a "sweet spot", with some elbow room left. >> >> Explicit configuration would've been exactly as painless. Even with >> pflash sizes restricted to powers of two. > > I wrote the patch that ended up as commit 637a5acb46b3 -- with your R-b > -- in 2013. I'm unsure if machine type properties existed back then, but > even if they did, do you think I knew about them? :) > > You are right, of course; it's just that we can't tell the future. True! All we can do is continue to design as well as we can given the information, experience and resources we have, and when the inevitable design mistakes become apparent, limit their impact. Some of the things we now consider mistakes we just didn't see. Others we saw (e.g. multiple pflash devices, unlike physical hardware), but underestimated their impact. [...]
Hi Laszlo and Markus, Thanks for your useful suggestions and comments! :) On 2019/3/27 2:36, Markus Armbruster wrote: > Laszlo Ersek <lersek@redhat.com> writes: > >> On 03/26/19 17:39, Markus Armbruster wrote: >>> Laszlo Ersek <lersek@redhat.com> writes: >> >>>> With the dynamic sizing in QEMU (which, IIRC, I had originally >>>> introduced still in the 1MB times, due to the split between the >>>> executable and varstore parts), both the 1MB->2MB switch, and the >>>> 2MB->4MB switch in the firmware caused zero pain in QEMU. And right now, >>>> 4MB looks like a "sweet spot", with some elbow room left. >>> >>> Explicit configuration would've been exactly as painless. Even with >>> pflash sizes restricted to powers of two. >> >> I wrote the patch that ended up as commit 637a5acb46b3 -- with your R-b >> -- in 2013. I'm unsure if machine type properties existed back then, but >> even if they did, do you think I knew about them? :) >> >> You are right, of course; it's just that we can't tell the future. > > True! All we can do is continue to design as well as we can given the > information, experience and resources we have, and when the inevitable > design mistakes become apparent, limit their impact. > > Some of the things we now consider mistakes we just didn't see. Others > we saw (e.g. multiple pflash devices, unlike physical hardware), but > underestimated their impact. > I thought about your comments and wrote the following patch (just for test) which uses a file mapping to replace the anonymous mapping. UEFI seems to work fine. So why not use a file mapping to read or write a pflash device? Except this way, I don't know how to share the pflash memory among VMs or reduce the consumption for resource. :( ---8>--- diff --git a/hw/arm/virt.c b/hw/arm/virt.c index ce2664a..12c78f2 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -898,6 +898,7 @@ static void create_one_flash(const char *name, hwaddr flashbase, qdev_prop_set_uint16(dev, "id2", 0x00); qdev_prop_set_uint16(dev, "id3", 0x00); qdev_prop_set_string(dev, "name", name); + qdev_prop_set_bit(dev, "prealloc", false); qdev_init_nofail(dev); memory_region_add_subregion(sysmem, flashbase, diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c index 16dfae1..23a85bc 100644 --- a/hw/block/pflash_cfi01.c +++ b/hw/block/pflash_cfi01.c @@ -92,6 +92,7 @@ struct PFlashCFI01 { void *storage; VMChangeStateEntry *vmstate; bool old_multiple_chip_handling; + bool prealloc; }; static int pflash_post_load(void *opaque, int version_id); @@ -731,11 +732,21 @@ static void pflash_cfi01_realize(DeviceState *dev, Error **errp) } device_len = sector_len_per_device * blocks_per_device; - memory_region_init_rom_device( - &pfl->mem, OBJECT(dev), - &pflash_cfi01_ops, - pfl, - pfl->name, total_len, &local_err); + if (pfl->blk && !pfl->prealloc) { + memory_region_init_rom_device_from_file( + &pfl->mem, OBJECT(dev), + &pflash_cfi01_ops, + pfl, + pfl->name, total_len, + blk_is_read_only(pfl->blk) ? RAM_SHARED : RAM_PMEM, + blk_bs(pfl->blk)->filename, &local_err); + } else { + memory_region_init_rom_device( + &pfl->mem, OBJECT(dev), + &pflash_cfi01_ops, + pfl, + pfl->name, total_len, &local_err); + } if (local_err) { error_propagate(errp, local_err); return; @@ -899,6 +910,7 @@ static Property pflash_cfi01_properties[] = { DEFINE_PROP_STRING("name", PFlashCFI01, name), DEFINE_PROP_BOOL("old-multiple-chip-handling", PFlashCFI01, old_multiple_chip_handling, false), + DEFINE_PROP_BOOL("prealloc", PFlashCFI01, prealloc, true), DEFINE_PROP_END_OF_LIST(), }; diff --git a/include/exec/memory.h b/include/exec/memory.h index 1625913..1b16d3b 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -804,6 +804,16 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr, uint64_t size, Error **errp); +void memory_region_init_rom_device_from_file(MemoryRegion *mr, + struct Object *owner, + const MemoryRegionOps *ops, + void *opaque, + const char *name, + uint64_t size, + uint32_t ram_flags, + const char *path, + Error **errp); + /** * memory_region_init_iommu: Initialize a memory region of a custom type * that translates addresses diff --git a/memory.c b/memory.c index 9fbca52..905956b 100644 --- a/memory.c +++ b/memory.c @@ -1719,6 +1719,36 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr, } } +void memory_region_init_rom_device_from_file(MemoryRegion *mr, + Object *owner, + const MemoryRegionOps *ops, + void *opaque, + const char *name, + uint64_t size, + uint32_t ram_flags, + const char *path, + Error **errp) +{ + DeviceState *owner_dev; + Error *err = NULL; + assert(ops); + memory_region_init(mr, owner, name, size); + mr->ops = ops; + mr->opaque = opaque; + mr->terminates = true; + mr->rom_device = true; + mr->destructor = memory_region_destructor_ram; + mr->ram_block = qemu_ram_alloc_from_file(size, mr, ram_flags, path, &err); + if (err) { + mr->size = int128_zero(); + object_unparent(OBJECT(mr)); + error_propagate(errp, err); + return ; + } + owner_dev = DEVICE(owner); + vmstate_register_ram(mr, owner_dev); +} + void memory_region_init_iommu(void *_iommu_mr, size_t instance_size, const char *mrtypename,
On 04/03/19 16:12, Xiang Zheng wrote: > Hi Laszlo and Markus, > > Thanks for your useful suggestions and comments! :) > > On 2019/3/27 2:36, Markus Armbruster wrote: >> Laszlo Ersek <lersek@redhat.com> writes: >> >>> On 03/26/19 17:39, Markus Armbruster wrote: >>>> Laszlo Ersek <lersek@redhat.com> writes: >>> >>>>> With the dynamic sizing in QEMU (which, IIRC, I had originally >>>>> introduced still in the 1MB times, due to the split between the >>>>> executable and varstore parts), both the 1MB->2MB switch, and the >>>>> 2MB->4MB switch in the firmware caused zero pain in QEMU. And right now, >>>>> 4MB looks like a "sweet spot", with some elbow room left. >>>> >>>> Explicit configuration would've been exactly as painless. Even with >>>> pflash sizes restricted to powers of two. >>> >>> I wrote the patch that ended up as commit 637a5acb46b3 -- with your R-b >>> -- in 2013. I'm unsure if machine type properties existed back then, but >>> even if they did, do you think I knew about them? :) >>> >>> You are right, of course; it's just that we can't tell the future. >> >> True! All we can do is continue to design as well as we can given the >> information, experience and resources we have, and when the inevitable >> design mistakes become apparent, limit their impact. >> >> Some of the things we now consider mistakes we just didn't see. Others >> we saw (e.g. multiple pflash devices, unlike physical hardware), but >> underestimated their impact. >> > > I thought about your comments and wrote the following patch (just for test) > which uses a file mapping to replace the anonymous mapping. UEFI seems to work > fine. So why not use a file mapping to read or write a pflash device? Honestly I can't answer "why not do this". Maybe we should. Regarding "why not do this *exactly as shown below*" -- probably because then we'd have updates to the same underlying regular file via both mmap() and write(), and the interactions between those are really tricky (= best avoided). One of my favorite questions over the years used to be posting a matrix of possible mmap() and file descriptor r/w/sync interactions, with the outcomes as they were "implied" by POSIX, and then asking at the bottom, "is my understanding correct?" I've never ever received an answer to that. :) Also... although we don't really use them in practice, some QCOW2 features for pflash block backends are -- or would be -- nice. (Not the internal snapshots or the live RAM dumps, of course.) Thanks Laszlo > Except this way, I don't know how to share the pflash memory among VMs or > reduce the consumption for resource. :( > > ---8>--- > > diff --git a/hw/arm/virt.c b/hw/arm/virt.c > index ce2664a..12c78f2 100644 > --- a/hw/arm/virt.c > +++ b/hw/arm/virt.c > @@ -898,6 +898,7 @@ static void create_one_flash(const char *name, hwaddr flashbase, > qdev_prop_set_uint16(dev, "id2", 0x00); > qdev_prop_set_uint16(dev, "id3", 0x00); > qdev_prop_set_string(dev, "name", name); > + qdev_prop_set_bit(dev, "prealloc", false); > qdev_init_nofail(dev); > > memory_region_add_subregion(sysmem, flashbase, > diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c > index 16dfae1..23a85bc 100644 > --- a/hw/block/pflash_cfi01.c > +++ b/hw/block/pflash_cfi01.c > @@ -92,6 +92,7 @@ struct PFlashCFI01 { > void *storage; > VMChangeStateEntry *vmstate; > bool old_multiple_chip_handling; > + bool prealloc; > }; > > static int pflash_post_load(void *opaque, int version_id); > @@ -731,11 +732,21 @@ static void pflash_cfi01_realize(DeviceState *dev, Error **errp) > } > device_len = sector_len_per_device * blocks_per_device; > > - memory_region_init_rom_device( > - &pfl->mem, OBJECT(dev), > - &pflash_cfi01_ops, > - pfl, > - pfl->name, total_len, &local_err); > + if (pfl->blk && !pfl->prealloc) { > + memory_region_init_rom_device_from_file( > + &pfl->mem, OBJECT(dev), > + &pflash_cfi01_ops, > + pfl, > + pfl->name, total_len, > + blk_is_read_only(pfl->blk) ? RAM_SHARED : RAM_PMEM, > + blk_bs(pfl->blk)->filename, &local_err); > + } else { > + memory_region_init_rom_device( > + &pfl->mem, OBJECT(dev), > + &pflash_cfi01_ops, > + pfl, > + pfl->name, total_len, &local_err); > + } > if (local_err) { > error_propagate(errp, local_err); > return; > @@ -899,6 +910,7 @@ static Property pflash_cfi01_properties[] = { > DEFINE_PROP_STRING("name", PFlashCFI01, name), > DEFINE_PROP_BOOL("old-multiple-chip-handling", PFlashCFI01, > old_multiple_chip_handling, false), > + DEFINE_PROP_BOOL("prealloc", PFlashCFI01, prealloc, true), > DEFINE_PROP_END_OF_LIST(), > }; > > diff --git a/include/exec/memory.h b/include/exec/memory.h > index 1625913..1b16d3b 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -804,6 +804,16 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr, > uint64_t size, > Error **errp); > > +void memory_region_init_rom_device_from_file(MemoryRegion *mr, > + struct Object *owner, > + const MemoryRegionOps *ops, > + void *opaque, > + const char *name, > + uint64_t size, > + uint32_t ram_flags, > + const char *path, > + Error **errp); > + > /** > * memory_region_init_iommu: Initialize a memory region of a custom type > * that translates addresses > diff --git a/memory.c b/memory.c > index 9fbca52..905956b 100644 > --- a/memory.c > +++ b/memory.c > @@ -1719,6 +1719,36 @@ void memory_region_init_rom_device_nomigrate(MemoryRegion *mr, > } > } > > +void memory_region_init_rom_device_from_file(MemoryRegion *mr, > + Object *owner, > + const MemoryRegionOps *ops, > + void *opaque, > + const char *name, > + uint64_t size, > + uint32_t ram_flags, > + const char *path, > + Error **errp) > +{ > + DeviceState *owner_dev; > + Error *err = NULL; > + assert(ops); > + memory_region_init(mr, owner, name, size); > + mr->ops = ops; > + mr->opaque = opaque; > + mr->terminates = true; > + mr->rom_device = true; > + mr->destructor = memory_region_destructor_ram; > + mr->ram_block = qemu_ram_alloc_from_file(size, mr, ram_flags, path, &err); > + if (err) { > + mr->size = int128_zero(); > + object_unparent(OBJECT(mr)); > + error_propagate(errp, err); > + return ; > + } > + owner_dev = DEVICE(owner); > + vmstate_register_ram(mr, owner_dev); > +} > + > void memory_region_init_iommu(void *_iommu_mr, > size_t instance_size, > const char *mrtypename, >
On 2019/4/3 23:35, Laszlo Ersek wrote: >> I thought about your comments and wrote the following patch (just for test) >> which uses a file mapping to replace the anonymous mapping. UEFI seems to work >> fine. So why not use a file mapping to read or write a pflash device? > Honestly I can't answer "why not do this". Maybe we should. > > Regarding "why not do this *exactly as shown below*" -- probably because > then we'd have updates to the same underlying regular file via both > mmap() and write(), and the interactions between those are really tricky > (= best avoided). > > One of my favorite questions over the years used to be posting a matrix > of possible mmap() and file descriptor r/w/sync interactions, with the > outcomes as they were "implied" by POSIX, and then asking at the bottom, > "is my understanding correct?" I've never ever received an answer to > that. :) In my opinion, it's feasible to r/w/sync the memory devices which use a block backend via mmap() and write(). > > Also... although we don't really use them in practice, some QCOW2 > features for pflash block backends are -- or would be -- nice. (Not the > internal snapshots or the live RAM dumps, of course.) Regarding sparse files, can we read actual backend size data for the read-only pflash memory as the following steps shown? 1) Create a sparse file -- QEMU_EFI-test.raw: dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw: dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below patch applied: ---8>--- diff --git a/hw/block/block.c b/hw/block/block.c index bf56c76..ad18d5e 100644 --- a/hw/block/block.c +++ b/hw/block/block.c @@ -30,7 +30,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, Error **errp) { - int64_t blk_len; + int64_t blk_len, actual_len; int ret; blk_len = blk_getlength(blk); @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, * block device and read only on demand. */ assert(size <= BDRV_REQUEST_MAX_BYTES); - ret = blk_pread(blk, 0, buf, size); + actual_len = bdrv_get_allocated_file_size(blk_bs(blk)); + ret = blk_pread(blk, 0, buf, + (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len); if (ret < 0) { error_setg_errno(errp, -ret, "can't read block backend"); return false;
On 04/08/19 15:43, Xiang Zheng wrote: > > On 2019/4/3 23:35, Laszlo Ersek wrote: >>> I thought about your comments and wrote the following patch (just for test) >>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work >>> fine. So why not use a file mapping to read or write a pflash device? >> Honestly I can't answer "why not do this". Maybe we should. >> >> Regarding "why not do this *exactly as shown below*" -- probably because >> then we'd have updates to the same underlying regular file via both >> mmap() and write(), and the interactions between those are really tricky >> (= best avoided). >> >> One of my favorite questions over the years used to be posting a matrix >> of possible mmap() and file descriptor r/w/sync interactions, with the >> outcomes as they were "implied" by POSIX, and then asking at the bottom, >> "is my understanding correct?" I've never ever received an answer to >> that. :) > > In my opinion, it's feasible to r/w/sync the memory devices which use a block > backend via mmap() and write(). Maybe. I think that would be a first in QEMU, and you'd likely have to describe and follow a semi-formal model between fd actions and mmap() actions. Here's the (unconfirmed) table I referred to earlier. +-------------+-----------------------------------------------------+ | change made | change visible via | | through +-----------------+-------------+---------------------+ | | MAP_SHARED | MAP_PRIVATE | read() | +-------------+-----------------+-------------+---------------------+ | MAP_SHARED | yes | unspecified | depends on MS_SYNC, | | | | | MS_ASYNC, or normal | | | | | system activity | +-------------+-----------------+-------------+---------------------+ | MAP_PRIVATE | no | no | no | +-------------+-----------------+-------------+---------------------+ | write() | depends on | unspecified | yes | | | MS_INVALIDATE, | | | | | or the system's | | | | | read/write | | | | | consistency | | | +-------------+-----------------+-------------+---------------------+ Presumably: - a write() to some offset range of a regular file should be visible in a MAP_SHARED mapping of that range after a matching msync(..., MS_INVALIDATE) call; - changes through a MAP_SHARED mapping to a regular file should be visible via read() after a matching msync(..., MS_SYNC) call returns. I sent this table first in 2010 to the Austin Group mailing list, and received no comments. Then another person (on the same list) asked basically the same questions in 2013, to which I responded with the above assumptions / interpretations -- and received no comments from third parties again. But, I'm really out of my depth here -- we're not even discussing generic read()/write()/mmap()/munmap()/msync() interactions, but how they would fit into the qemu block layer. And I have no idea. > >> >> Also... although we don't really use them in practice, some QCOW2 >> features for pflash block backends are -- or would be -- nice. (Not the >> internal snapshots or the live RAM dumps, of course.) > > Regarding sparse files, can we read actual backend size data for the read-only > pflash memory as the following steps shown? > > 1) Create a sparse file -- QEMU_EFI-test.raw: > dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0 > > 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw: > dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc > > 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below > patch applied: > > ---8>--- > > diff --git a/hw/block/block.c b/hw/block/block.c > index bf56c76..ad18d5e 100644 > --- a/hw/block/block.c > +++ b/hw/block/block.c > @@ -30,7 +30,7 @@ > bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, > Error **errp) > { > - int64_t blk_len; > + int64_t blk_len, actual_len; > int ret; > > blk_len = blk_getlength(blk); > @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, > * block device and read only on demand. > */ > assert(size <= BDRV_REQUEST_MAX_BYTES); > - ret = blk_pread(blk, 0, buf, size); > + actual_len = bdrv_get_allocated_file_size(blk_bs(blk)); > + ret = blk_pread(blk, 0, buf, > + (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len); > if (ret < 0) { > error_setg_errno(errp, -ret, "can't read block backend"); > return false; > This hunk looks dubious for a general helper function. It seems to assume that the holes are punched at the end of the file. Sorry, this discussion is making me uncomfortable. I don't want to ignore your questions, but I have no idea about the right answers. This really needs the attention of the block people. Laszlo
On 2019/4/9 0:14, Laszlo Ersek wrote: > On 04/08/19 15:43, Xiang Zheng wrote: >> >> On 2019/4/3 23:35, Laszlo Ersek wrote: >>>> I thought about your comments and wrote the following patch (just for test) >>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work >>>> fine. So why not use a file mapping to read or write a pflash device? >>> Honestly I can't answer "why not do this". Maybe we should. >>> >>> Regarding "why not do this *exactly as shown below*" -- probably because >>> then we'd have updates to the same underlying regular file via both >>> mmap() and write(), and the interactions between those are really tricky >>> (= best avoided). >>> >>> One of my favorite questions over the years used to be posting a matrix >>> of possible mmap() and file descriptor r/w/sync interactions, with the >>> outcomes as they were "implied" by POSIX, and then asking at the bottom, >>> "is my understanding correct?" I've never ever received an answer to >>> that. :) >> >> In my opinion, it's feasible to r/w/sync the memory devices which use a block >> backend via mmap() and write(). > > Maybe. I think that would be a first in QEMU, and you'd likely have to > describe and follow a semi-formal model between fd actions and mmap() > actions. > > Here's the (unconfirmed) table I referred to earlier. > > +-------------+-----------------------------------------------------+ > | change made | change visible via | > | through +-----------------+-------------+---------------------+ > | | MAP_SHARED | MAP_PRIVATE | read() | > +-------------+-----------------+-------------+---------------------+ > | MAP_SHARED | yes | unspecified | depends on MS_SYNC, | > | | | | MS_ASYNC, or normal | > | | | | system activity | > +-------------+-----------------+-------------+---------------------+ > | MAP_PRIVATE | no | no | no | > +-------------+-----------------+-------------+---------------------+ > | write() | depends on | unspecified | yes | > | | MS_INVALIDATE, | | | > | | or the system's | | | > | | read/write | | | > | | consistency | | | > +-------------+-----------------+-------------+---------------------+ > > Presumably: > > - a write() to some offset range of a regular file should be visible in > a MAP_SHARED mapping of that range after a matching msync(..., > MS_INVALIDATE) call; > > - changes through a MAP_SHARED mapping to a regular file should be > visible via read() after a matching msync(..., MS_SYNC) call returns. > > I sent this table first in 2010 to the Austin Group mailing list, and > received no comments. Then another person (on the same list) asked > basically the same questions in 2013, to which I responded with the > above assumptions / interpretations -- and received no comments from > third parties again. > > But, I'm really out of my depth here -- we're not even discussing > generic read()/write()/mmap()/munmap()/msync() interactions, but how > they would fit into the qemu block layer. And I have no idea. > >> >>> >>> Also... although we don't really use them in practice, some QCOW2 >>> features for pflash block backends are -- or would be -- nice. (Not the >>> internal snapshots or the live RAM dumps, of course.) >> >> Regarding sparse files, can we read actual backend size data for the read-only >> pflash memory as the following steps shown? >> >> 1) Create a sparse file -- QEMU_EFI-test.raw: >> dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0 >> >> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw: >> dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc >> >> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below >> patch applied: >> >> ---8>--- >> >> diff --git a/hw/block/block.c b/hw/block/block.c >> index bf56c76..ad18d5e 100644 >> --- a/hw/block/block.c >> +++ b/hw/block/block.c >> @@ -30,7 +30,7 @@ >> bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, >> Error **errp) >> { >> - int64_t blk_len; >> + int64_t blk_len, actual_len; >> int ret; >> >> blk_len = blk_getlength(blk); >> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, >> * block device and read only on demand. >> */ >> assert(size <= BDRV_REQUEST_MAX_BYTES); >> - ret = blk_pread(blk, 0, buf, size); >> + actual_len = bdrv_get_allocated_file_size(blk_bs(blk)); >> + ret = blk_pread(blk, 0, buf, >> + (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len); >> if (ret < 0) { >> error_setg_errno(errp, -ret, "can't read block backend"); >> return false; >> > > This hunk looks dubious for a general helper function. It seems to > assume that the holes are punched at the end of the file. > > Sorry, this discussion is making me uncomfortable. I don't want to > ignore your questions, but I have no idea about the right answers. This > really needs the attention of the block people. I feel deeply sorry for making you uncomfortable. It's generous of you to give me so much of your time.
László's last sentence below is "This really needs the attention of the block people." Cc'ing some. Laszlo Ersek <lersek@redhat.com> writes: > On 04/08/19 15:43, Xiang Zheng wrote: >> >> On 2019/4/3 23:35, Laszlo Ersek wrote: >>>> I thought about your comments and wrote the following patch (just for test) >>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work >>>> fine. So why not use a file mapping to read or write a pflash device? >>> Honestly I can't answer "why not do this". Maybe we should. >>> >>> Regarding "why not do this *exactly as shown below*" -- probably because >>> then we'd have updates to the same underlying regular file via both >>> mmap() and write(), and the interactions between those are really tricky >>> (= best avoided). >>> >>> One of my favorite questions over the years used to be posting a matrix >>> of possible mmap() and file descriptor r/w/sync interactions, with the >>> outcomes as they were "implied" by POSIX, and then asking at the bottom, >>> "is my understanding correct?" I've never ever received an answer to >>> that. :) >> >> In my opinion, it's feasible to r/w/sync the memory devices which use a block >> backend via mmap() and write(). > > Maybe. I think that would be a first in QEMU, and you'd likely have to > describe and follow a semi-formal model between fd actions and mmap() > actions. > > Here's the (unconfirmed) table I referred to earlier. > > +-------------+-----------------------------------------------------+ > | change made | change visible via | > | through +-----------------+-------------+---------------------+ > | | MAP_SHARED | MAP_PRIVATE | read() | > +-------------+-----------------+-------------+---------------------+ > | MAP_SHARED | yes | unspecified | depends on MS_SYNC, | > | | | | MS_ASYNC, or normal | > | | | | system activity | > +-------------+-----------------+-------------+---------------------+ > | MAP_PRIVATE | no | no | no | > +-------------+-----------------+-------------+---------------------+ > | write() | depends on | unspecified | yes | > | | MS_INVALIDATE, | | | > | | or the system's | | | > | | read/write | | | > | | consistency | | | > +-------------+-----------------+-------------+---------------------+ > > Presumably: > > - a write() to some offset range of a regular file should be visible in > a MAP_SHARED mapping of that range after a matching msync(..., > MS_INVALIDATE) call; > > - changes through a MAP_SHARED mapping to a regular file should be > visible via read() after a matching msync(..., MS_SYNC) call returns. > > I sent this table first in 2010 to the Austin Group mailing list, and > received no comments. Then another person (on the same list) asked > basically the same questions in 2013, to which I responded with the > above assumptions / interpretations -- and received no comments from > third parties again. > > But, I'm really out of my depth here -- we're not even discussing > generic read()/write()/mmap()/munmap()/msync() interactions, but how > they would fit into the qemu block layer. And I have no idea. > >> >>> >>> Also... although we don't really use them in practice, some QCOW2 >>> features for pflash block backends are -- or would be -- nice. (Not the >>> internal snapshots or the live RAM dumps, of course.) >> >> Regarding sparse files, can we read actual backend size data for the read-only >> pflash memory as the following steps shown? >> >> 1) Create a sparse file -- QEMU_EFI-test.raw: >> dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0 >> >> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw: >> dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc >> >> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below >> patch applied: >> >> ---8>--- >> >> diff --git a/hw/block/block.c b/hw/block/block.c >> index bf56c76..ad18d5e 100644 >> --- a/hw/block/block.c >> +++ b/hw/block/block.c >> @@ -30,7 +30,7 @@ >> bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, >> Error **errp) >> { >> - int64_t blk_len; >> + int64_t blk_len, actual_len; >> int ret; >> >> blk_len = blk_getlength(blk); >> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, >> * block device and read only on demand. >> */ >> assert(size <= BDRV_REQUEST_MAX_BYTES); >> - ret = blk_pread(blk, 0, buf, size); >> + actual_len = bdrv_get_allocated_file_size(blk_bs(blk)); >> + ret = blk_pread(blk, 0, buf, >> + (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len); >> if (ret < 0) { >> error_setg_errno(errp, -ret, "can't read block backend"); >> return false; >> > > This hunk looks dubious for a general helper function. It seems to > assume that the holes are punched at the end of the file. > > Sorry, this discussion is making me uncomfortable. I don't want to > ignore your questions, but I have no idea about the right answers. This > really needs the attention of the block people.
Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben: > László's last sentence below is "This really needs the attention of the > block people." Cc'ing some. > > Laszlo Ersek <lersek@redhat.com> writes: > > > On 04/08/19 15:43, Xiang Zheng wrote: > >> > >> On 2019/4/3 23:35, Laszlo Ersek wrote: > >>>> I thought about your comments and wrote the following patch (just for test) > >>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work > >>>> fine. So why not use a file mapping to read or write a pflash device? > >>> Honestly I can't answer "why not do this". Maybe we should. > >>> > >>> Regarding "why not do this *exactly as shown below*" -- probably because > >>> then we'd have updates to the same underlying regular file via both > >>> mmap() and write(), and the interactions between those are really tricky > >>> (= best avoided). > >>> > >>> One of my favorite questions over the years used to be posting a matrix > >>> of possible mmap() and file descriptor r/w/sync interactions, with the > >>> outcomes as they were "implied" by POSIX, and then asking at the bottom, > >>> "is my understanding correct?" I've never ever received an answer to > >>> that. :) > >> > >> In my opinion, it's feasible to r/w/sync the memory devices which use a block > >> backend via mmap() and write(). > > > > Maybe. I think that would be a first in QEMU, and you'd likely have to > > describe and follow a semi-formal model between fd actions and mmap() > > actions. > > > > Here's the (unconfirmed) table I referred to earlier. > > > > +-------------+-----------------------------------------------------+ > > | change made | change visible via | > > | through +-----------------+-------------+---------------------+ > > | | MAP_SHARED | MAP_PRIVATE | read() | > > +-------------+-----------------+-------------+---------------------+ > > | MAP_SHARED | yes | unspecified | depends on MS_SYNC, | > > | | | | MS_ASYNC, or normal | > > | | | | system activity | > > +-------------+-----------------+-------------+---------------------+ > > | MAP_PRIVATE | no | no | no | > > +-------------+-----------------+-------------+---------------------+ > > | write() | depends on | unspecified | yes | > > | | MS_INVALIDATE, | | | > > | | or the system's | | | > > | | read/write | | | > > | | consistency | | | > > +-------------+-----------------+-------------+---------------------+ > > > > Presumably: > > > > - a write() to some offset range of a regular file should be visible in > > a MAP_SHARED mapping of that range after a matching msync(..., > > MS_INVALIDATE) call; > > > > - changes through a MAP_SHARED mapping to a regular file should be > > visible via read() after a matching msync(..., MS_SYNC) call returns. > > > > I sent this table first in 2010 to the Austin Group mailing list, and > > received no comments. Then another person (on the same list) asked > > basically the same questions in 2013, to which I responded with the > > above assumptions / interpretations -- and received no comments from > > third parties again. > > > > But, I'm really out of my depth here -- we're not even discussing > > generic read()/write()/mmap()/munmap()/msync() interactions, but how > > they would fit into the qemu block layer. And I have no idea. There is no infrastructure in the block layer for mmapping image files. > >>> Also... although we don't really use them in practice, some QCOW2 > >>> features for pflash block backends are -- or would be -- nice. (Not the > >>> internal snapshots or the live RAM dumps, of course.) > >> > >> Regarding sparse files, can we read actual backend size data for the read-only > >> pflash memory as the following steps shown? > >> > >> 1) Create a sparse file -- QEMU_EFI-test.raw: > >> dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0 > >> > >> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw: > >> dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc > >> > >> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below > >> patch applied: > >> > >> ---8>--- > >> > >> diff --git a/hw/block/block.c b/hw/block/block.c > >> index bf56c76..ad18d5e 100644 > >> --- a/hw/block/block.c > >> +++ b/hw/block/block.c > >> @@ -30,7 +30,7 @@ > >> bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, > >> Error **errp) > >> { > >> - int64_t blk_len; > >> + int64_t blk_len, actual_len; > >> int ret; > >> > >> blk_len = blk_getlength(blk); > >> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, > >> * block device and read only on demand. > >> */ > >> assert(size <= BDRV_REQUEST_MAX_BYTES); > >> - ret = blk_pread(blk, 0, buf, size); > >> + actual_len = bdrv_get_allocated_file_size(blk_bs(blk)); > >> + ret = blk_pread(blk, 0, buf, > >> + (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len); > >> if (ret < 0) { > >> error_setg_errno(errp, -ret, "can't read block backend"); > >> return false; > >> > > > > This hunk looks dubious for a general helper function. It seems to > > assume that the holes are punched at the end of the file. > > > > Sorry, this discussion is making me uncomfortable. I don't want to > > ignore your questions, but I have no idea about the right answers. This > > really needs the attention of the block people. Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even implemented in all block drivers, and when it is implemented it isn't necessarily reliable (e.g. return 0 for block devices). It's fine for 'qemu-img info', but that's it. But even if you actually get the correct allocated file size, that's the disk space used on the host filesystem, so it doesn't include any holes, but it does include image format metadata, the data could possibly be compressed etc. In other words, for a guest device it's completely meaningless. I'm not even sure why you would want to do this even in your special case of a raw image that is sparse only at the end. Is it an optimisation to avoid reading zeros? This ends up basically being memset() for sparse blocks, so very quick. And I think you do want to zero out the remaining part of the buffer anyway. So it looks to me as if it were complicating the code for no use. Kevin
Hi Kevin, On 2019/4/9 16:28, Kevin Wolf wrote: > Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben: >> László's last sentence below is "This really needs the attention of the >> block people." Cc'ing some. >> >> Laszlo Ersek <lersek@redhat.com> writes: >> >>> On 04/08/19 15:43, Xiang Zheng wrote: >>>> >>>> On 2019/4/3 23:35, Laszlo Ersek wrote: >>>>>> I thought about your comments and wrote the following patch (just for test) >>>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work >>>>>> fine. So why not use a file mapping to read or write a pflash device? >>>>> Honestly I can't answer "why not do this". Maybe we should. >>>>> >>>>> Regarding "why not do this *exactly as shown below*" -- probably because >>>>> then we'd have updates to the same underlying regular file via both >>>>> mmap() and write(), and the interactions between those are really tricky >>>>> (= best avoided). >>>>> >>>>> One of my favorite questions over the years used to be posting a matrix >>>>> of possible mmap() and file descriptor r/w/sync interactions, with the >>>>> outcomes as they were "implied" by POSIX, and then asking at the bottom, >>>>> "is my understanding correct?" I've never ever received an answer to >>>>> that. :) >>>> >>>> In my opinion, it's feasible to r/w/sync the memory devices which use a block >>>> backend via mmap() and write(). >>> >>> Maybe. I think that would be a first in QEMU, and you'd likely have to >>> describe and follow a semi-formal model between fd actions and mmap() >>> actions. >>> >>> Here's the (unconfirmed) table I referred to earlier. >>> >>> +-------------+-----------------------------------------------------+ >>> | change made | change visible via | >>> | through +-----------------+-------------+---------------------+ >>> | | MAP_SHARED | MAP_PRIVATE | read() | >>> +-------------+-----------------+-------------+---------------------+ >>> | MAP_SHARED | yes | unspecified | depends on MS_SYNC, | >>> | | | | MS_ASYNC, or normal | >>> | | | | system activity | >>> +-------------+-----------------+-------------+---------------------+ >>> | MAP_PRIVATE | no | no | no | >>> +-------------+-----------------+-------------+---------------------+ >>> | write() | depends on | unspecified | yes | >>> | | MS_INVALIDATE, | | | >>> | | or the system's | | | >>> | | read/write | | | >>> | | consistency | | | >>> +-------------+-----------------+-------------+---------------------+ >>> >>> Presumably: >>> >>> - a write() to some offset range of a regular file should be visible in >>> a MAP_SHARED mapping of that range after a matching msync(..., >>> MS_INVALIDATE) call; >>> >>> - changes through a MAP_SHARED mapping to a regular file should be >>> visible via read() after a matching msync(..., MS_SYNC) call returns. >>> >>> I sent this table first in 2010 to the Austin Group mailing list, and >>> received no comments. Then another person (on the same list) asked >>> basically the same questions in 2013, to which I responded with the >>> above assumptions / interpretations -- and received no comments from >>> third parties again. >>> >>> But, I'm really out of my depth here -- we're not even discussing >>> generic read()/write()/mmap()/munmap()/msync() interactions, but how >>> they would fit into the qemu block layer. And I have no idea. > > There is no infrastructure in the block layer for mmapping image files. > >>>>> Also... although we don't really use them in practice, some QCOW2 >>>>> features for pflash block backends are -- or would be -- nice. (Not the >>>>> internal snapshots or the live RAM dumps, of course.) >>>> >>>> Regarding sparse files, can we read actual backend size data for the read-only >>>> pflash memory as the following steps shown? >>>> >>>> 1) Create a sparse file -- QEMU_EFI-test.raw: >>>> dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0 >>>> >>>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw: >>>> dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc >>>> >>>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below >>>> patch applied: >>>> >>>> ---8>--- >>>> >>>> diff --git a/hw/block/block.c b/hw/block/block.c >>>> index bf56c76..ad18d5e 100644 >>>> --- a/hw/block/block.c >>>> +++ b/hw/block/block.c >>>> @@ -30,7 +30,7 @@ >>>> bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, >>>> Error **errp) >>>> { >>>> - int64_t blk_len; >>>> + int64_t blk_len, actual_len; >>>> int ret; >>>> >>>> blk_len = blk_getlength(blk); >>>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, >>>> * block device and read only on demand. >>>> */ >>>> assert(size <= BDRV_REQUEST_MAX_BYTES); >>>> - ret = blk_pread(blk, 0, buf, size); >>>> + actual_len = bdrv_get_allocated_file_size(blk_bs(blk)); >>>> + ret = blk_pread(blk, 0, buf, >>>> + (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len); >>>> if (ret < 0) { >>>> error_setg_errno(errp, -ret, "can't read block backend"); >>>> return false; >>>> >>> >>> This hunk looks dubious for a general helper function. It seems to >>> assume that the holes are punched at the end of the file. >>> >>> Sorry, this discussion is making me uncomfortable. I don't want to >>> ignore your questions, but I have no idea about the right answers. This >>> really needs the attention of the block people. > > Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even > implemented in all block drivers, and when it is implemented it isn't > necessarily reliable (e.g. return 0 for block devices). It's fine for > 'qemu-img info', but that's it. > > But even if you actually get the correct allocated file size, that's the > disk space used on the host filesystem, so it doesn't include any holes, > but it does include image format metadata, the data could possibly be > compressed etc. In other words, for a guest device it's completely > meaningless. > > I'm not even sure why you would want to do this even in your special > case of a raw image that is sparse only at the end. Is it an > optimisation to avoid reading zeros? This ends up basically being > memset() for sparse blocks, so very quick. And I think you do want to > zero out the remaining part of the buffer anyway. So it looks to me as > if it were complicating the code for no use. The primary target of this discussion is to save the memory allocated for UEFI firmware device. On virtual machine, we split it into two flash devices[1] -- one is read-only in 64MB size and another is read-write in 64MB size. The qemu commandline is: -drive file=QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on \ -drive file=empty_VARS.fd,if=pflash,format=raw,unit=1 \ Regarding the read-only one which are created from QEMU_EFI.fd, the original QEMU_EFI.fd is only 2MB in size and we need to stuff it to 64MB with 62MB 'zeros': dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64 dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc For some historical reasons such as compatibility and extensibility[2], we restrict both their size to 64MB -- both UEFI and qemu have cold hard constants. This will consume a large amount of memory when running multiple VM simultaneously (each guest needs 128MB). There are accepted ideas proposed by Markus and Laszlo from prior discussion[3]: 1) Getting flash memory size from a machine type or property. 2) Map the read-only part so it is shared among guests; Map the read-write part normally. The first idea called explicit configuration which may break migration. For the second idea, the only way I can imagine is using a file mapping to read or write pflash devices so that we can allocate memory on demand. So I tried to fit mmap() actions into the block layer[4]. However I am not sure that maping image file whether can work fine for block layer. On the other hand, Laszlo had mentioned the features of QCOW2 and sparse files which may be nice for pflash block backends. But I am not familiar with it. Till now I still have no idea to get rid of this problem (more properly it's a optimization). [1]https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html [2]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07009.html [3]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07107.html [4]https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00719.html
Xiang Zheng <zhengxiang9@huawei.com> writes: > Hi Kevin, > > On 2019/4/9 16:28, Kevin Wolf wrote: >> Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben: >>> László's last sentence below is "This really needs the attention of the >>> block people." Cc'ing some. >>> >>> Laszlo Ersek <lersek@redhat.com> writes: >>> >>>> On 04/08/19 15:43, Xiang Zheng wrote: >>>>> >>>>> On 2019/4/3 23:35, Laszlo Ersek wrote: >>>>>>> I thought about your comments and wrote the following patch (just for test) >>>>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work >>>>>>> fine. So why not use a file mapping to read or write a pflash device? >>>>>> Honestly I can't answer "why not do this". Maybe we should. >>>>>> >>>>>> Regarding "why not do this *exactly as shown below*" -- probably because >>>>>> then we'd have updates to the same underlying regular file via both >>>>>> mmap() and write(), and the interactions between those are really tricky >>>>>> (= best avoided). >>>>>> >>>>>> One of my favorite questions over the years used to be posting a matrix >>>>>> of possible mmap() and file descriptor r/w/sync interactions, with the >>>>>> outcomes as they were "implied" by POSIX, and then asking at the bottom, >>>>>> "is my understanding correct?" I've never ever received an answer to >>>>>> that. :) >>>>> >>>>> In my opinion, it's feasible to r/w/sync the memory devices which use a block >>>>> backend via mmap() and write(). >>>> >>>> Maybe. I think that would be a first in QEMU, and you'd likely have to >>>> describe and follow a semi-formal model between fd actions and mmap() >>>> actions. >>>> >>>> Here's the (unconfirmed) table I referred to earlier. >>>> >>>> +-------------+-----------------------------------------------------+ >>>> | change made | change visible via | >>>> | through +-----------------+-------------+---------------------+ >>>> | | MAP_SHARED | MAP_PRIVATE | read() | >>>> +-------------+-----------------+-------------+---------------------+ >>>> | MAP_SHARED | yes | unspecified | depends on MS_SYNC, | >>>> | | | | MS_ASYNC, or normal | >>>> | | | | system activity | >>>> +-------------+-----------------+-------------+---------------------+ >>>> | MAP_PRIVATE | no | no | no | >>>> +-------------+-----------------+-------------+---------------------+ >>>> | write() | depends on | unspecified | yes | >>>> | | MS_INVALIDATE, | | | >>>> | | or the system's | | | >>>> | | read/write | | | >>>> | | consistency | | | >>>> +-------------+-----------------+-------------+---------------------+ >>>> >>>> Presumably: >>>> >>>> - a write() to some offset range of a regular file should be visible in >>>> a MAP_SHARED mapping of that range after a matching msync(..., >>>> MS_INVALIDATE) call; >>>> >>>> - changes through a MAP_SHARED mapping to a regular file should be >>>> visible via read() after a matching msync(..., MS_SYNC) call returns. >>>> >>>> I sent this table first in 2010 to the Austin Group mailing list, and >>>> received no comments. Then another person (on the same list) asked >>>> basically the same questions in 2013, to which I responded with the >>>> above assumptions / interpretations -- and received no comments from >>>> third parties again. >>>> >>>> But, I'm really out of my depth here -- we're not even discussing >>>> generic read()/write()/mmap()/munmap()/msync() interactions, but how >>>> they would fit into the qemu block layer. And I have no idea. >> >> There is no infrastructure in the block layer for mmapping image files. >> >>>>>> Also... although we don't really use them in practice, some QCOW2 >>>>>> features for pflash block backends are -- or would be -- nice. (Not the >>>>>> internal snapshots or the live RAM dumps, of course.) >>>>> >>>>> Regarding sparse files, can we read actual backend size data for the read-only >>>>> pflash memory as the following steps shown? >>>>> >>>>> 1) Create a sparse file -- QEMU_EFI-test.raw: >>>>> dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0 >>>>> >>>>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw: >>>>> dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc >>>>> >>>>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below >>>>> patch applied: >>>>> >>>>> ---8>--- >>>>> >>>>> diff --git a/hw/block/block.c b/hw/block/block.c >>>>> index bf56c76..ad18d5e 100644 >>>>> --- a/hw/block/block.c >>>>> +++ b/hw/block/block.c >>>>> @@ -30,7 +30,7 @@ >>>>> bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, >>>>> Error **errp) >>>>> { >>>>> - int64_t blk_len; >>>>> + int64_t blk_len, actual_len; >>>>> int ret; >>>>> >>>>> blk_len = blk_getlength(blk); >>>>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, >>>>> * block device and read only on demand. >>>>> */ >>>>> assert(size <= BDRV_REQUEST_MAX_BYTES); >>>>> - ret = blk_pread(blk, 0, buf, size); >>>>> + actual_len = bdrv_get_allocated_file_size(blk_bs(blk)); >>>>> + ret = blk_pread(blk, 0, buf, >>>>> + (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len); >>>>> if (ret < 0) { >>>>> error_setg_errno(errp, -ret, "can't read block backend"); >>>>> return false; >>>>> >>>> >>>> This hunk looks dubious for a general helper function. It seems to >>>> assume that the holes are punched at the end of the file. >>>> >>>> Sorry, this discussion is making me uncomfortable. I don't want to >>>> ignore your questions, but I have no idea about the right answers. This >>>> really needs the attention of the block people. >> >> Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even >> implemented in all block drivers, and when it is implemented it isn't >> necessarily reliable (e.g. return 0 for block devices). It's fine for >> 'qemu-img info', but that's it. >> >> But even if you actually get the correct allocated file size, that's the >> disk space used on the host filesystem, so it doesn't include any holes, >> but it does include image format metadata, the data could possibly be >> compressed etc. In other words, for a guest device it's completely >> meaningless. >> >> I'm not even sure why you would want to do this even in your special >> case of a raw image that is sparse only at the end. Is it an >> optimisation to avoid reading zeros? This ends up basically being >> memset() for sparse blocks, so very quick. And I think you do want to >> zero out the remaining part of the buffer anyway. So it looks to me as >> if it were complicating the code for no use. > > The primary target of this discussion is to save the memory allocated > for UEFI firmware device. On virtual machine, we split it into two flash > devices[1] -- one is read-only in 64MB size and another is read-write in > 64MB size. The qemu commandline is: > > -drive file=QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on \ > -drive file=empty_VARS.fd,if=pflash,format=raw,unit=1 \ > > Regarding the read-only one which are created from QEMU_EFI.fd, the original > QEMU_EFI.fd is only 2MB in size and we need to stuff it to 64MB with 62MB > 'zeros': > > dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64 > dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc > > For some historical reasons such as compatibility and extensibility[2], we > restrict both their size to 64MB -- both UEFI and qemu have cold hard > constants. These reasons aren't historical. But they're valid, and that's all that matters :) > This will consume a large amount of memory when running multiple VM > simultaneously (each guest needs 128MB). Understood. > There are accepted ideas proposed by Markus and Laszlo from prior discussion[3]: > > 1) Getting flash memory size from a machine type or property. > > 2) Map the read-only part so it is shared among guests; Map the read-write > part normally. > > The first idea called explicit configuration which may break migration. For any configuration that can break migration, the solution is to have identical configuration on source and destination. > For the second idea, the only way I can imagine is using a file mapping to > read or write pflash devices so that we can allocate memory on demand. So I > tried to fit mmap() actions into the block layer[4]. However I am not sure > that maping image file whether can work fine for block layer. It adds an odd special case to the block layer. But then flash devices are an odd user of the block layer. Normal users use the block layer as a *block* layer: to read and write blocks. Our flash devices don't do that. They are *memory* devices, not *block* devices. Pretty much the only thing the two have in common is persistence of data. The block layer provides persistence, and I figure that's why it got pressed into service here. What the flash devices do is keep the complete image contents in memory, with changes written through to disk. This is pretty much exactly what mmap() does, so using it suggests itself. It's very much *not* what the block layer does. mmap() is not even possible for anything but a file backend. With mmap(), memory gets faulted in on demand, and sharing read-only memory with other VMs should be feasible. The question is how to use it here. Is there a sane way to use it without rewriting the flash devices not to use the block layer? > On the other hand, Laszlo had mentioned the features of QCOW2 and sparse > files which may be nice for pflash block backends. But I am not familiar with > it. Offhand, two QCOW2 features come to mind: 1. Sparse storage even when the storage substrate can't do sparse. I consider this one largely irrelevant. Any decent file system can do sparse. Archival and network transfer can compress instead. 2. Snapshots to go with disk snapshots. *Live* snapshots (the ones that include memory) already have the flash contents in the memory part, so this sems only relevant for "dead" snapshots. Anything else? > Till now I still have no idea to get rid of this problem (more properly it's > a optimization). > > > [1]https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg06606.html > [2]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07009.html > [3]https://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg07107.html > [4]https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00719.html
Am 10.04.2019 um 10:36 hat Xiang Zheng geschrieben: > Hi Kevin, > > On 2019/4/9 16:28, Kevin Wolf wrote: > > Am 09.04.2019 um 08:01 hat Markus Armbruster geschrieben: > >> László's last sentence below is "This really needs the attention of the > >> block people." Cc'ing some. > >> > >> Laszlo Ersek <lersek@redhat.com> writes: > >> > >>> On 04/08/19 15:43, Xiang Zheng wrote: > >>>> > >>>> On 2019/4/3 23:35, Laszlo Ersek wrote: > >>>>>> I thought about your comments and wrote the following patch (just for test) > >>>>>> which uses a file mapping to replace the anonymous mapping. UEFI seems to work > >>>>>> fine. So why not use a file mapping to read or write a pflash device? > >>>>> Honestly I can't answer "why not do this". Maybe we should. > >>>>> > >>>>> Regarding "why not do this *exactly as shown below*" -- probably because > >>>>> then we'd have updates to the same underlying regular file via both > >>>>> mmap() and write(), and the interactions between those are really tricky > >>>>> (= best avoided). > >>>>> > >>>>> One of my favorite questions over the years used to be posting a matrix > >>>>> of possible mmap() and file descriptor r/w/sync interactions, with the > >>>>> outcomes as they were "implied" by POSIX, and then asking at the bottom, > >>>>> "is my understanding correct?" I've never ever received an answer to > >>>>> that. :) > >>>> > >>>> In my opinion, it's feasible to r/w/sync the memory devices which use a block > >>>> backend via mmap() and write(). > >>> > >>> Maybe. I think that would be a first in QEMU, and you'd likely have to > >>> describe and follow a semi-formal model between fd actions and mmap() > >>> actions. > >>> > >>> Here's the (unconfirmed) table I referred to earlier. > >>> > >>> +-------------+-----------------------------------------------------+ > >>> | change made | change visible via | > >>> | through +-----------------+-------------+---------------------+ > >>> | | MAP_SHARED | MAP_PRIVATE | read() | > >>> +-------------+-----------------+-------------+---------------------+ > >>> | MAP_SHARED | yes | unspecified | depends on MS_SYNC, | > >>> | | | | MS_ASYNC, or normal | > >>> | | | | system activity | > >>> +-------------+-----------------+-------------+---------------------+ > >>> | MAP_PRIVATE | no | no | no | > >>> +-------------+-----------------+-------------+---------------------+ > >>> | write() | depends on | unspecified | yes | > >>> | | MS_INVALIDATE, | | | > >>> | | or the system's | | | > >>> | | read/write | | | > >>> | | consistency | | | > >>> +-------------+-----------------+-------------+---------------------+ > >>> > >>> Presumably: > >>> > >>> - a write() to some offset range of a regular file should be visible in > >>> a MAP_SHARED mapping of that range after a matching msync(..., > >>> MS_INVALIDATE) call; > >>> > >>> - changes through a MAP_SHARED mapping to a regular file should be > >>> visible via read() after a matching msync(..., MS_SYNC) call returns. > >>> > >>> I sent this table first in 2010 to the Austin Group mailing list, and > >>> received no comments. Then another person (on the same list) asked > >>> basically the same questions in 2013, to which I responded with the > >>> above assumptions / interpretations -- and received no comments from > >>> third parties again. > >>> > >>> But, I'm really out of my depth here -- we're not even discussing > >>> generic read()/write()/mmap()/munmap()/msync() interactions, but how > >>> they would fit into the qemu block layer. And I have no idea. > > > > There is no infrastructure in the block layer for mmapping image files. > > > >>>>> Also... although we don't really use them in practice, some QCOW2 > >>>>> features for pflash block backends are -- or would be -- nice. (Not the > >>>>> internal snapshots or the live RAM dumps, of course.) > >>>> > >>>> Regarding sparse files, can we read actual backend size data for the read-only > >>>> pflash memory as the following steps shown? > >>>> > >>>> 1) Create a sparse file -- QEMU_EFI-test.raw: > >>>> dd if=/dev/zero of=QEMU_EFI-test.raw bs=1M seek=64 count=0 > >>>> > >>>> 2) Read from QEMU_EFI.fd and write to QEMU_EFI-test.raw: > >>>> dd of="QEMU_EFI-test.raw" if="QEMU_EFI.fd" conv=notrunc > >>>> > >>>> 3) Use QEMU_EFI-test.raw as the read-only pflash and start qemu with the below > >>>> patch applied: > >>>> > >>>> ---8>--- > >>>> > >>>> diff --git a/hw/block/block.c b/hw/block/block.c > >>>> index bf56c76..ad18d5e 100644 > >>>> --- a/hw/block/block.c > >>>> +++ b/hw/block/block.c > >>>> @@ -30,7 +30,7 @@ > >>>> bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, > >>>> Error **errp) > >>>> { > >>>> - int64_t blk_len; > >>>> + int64_t blk_len, actual_len; > >>>> int ret; > >>>> > >>>> blk_len = blk_getlength(blk); > >>>> @@ -53,7 +53,9 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, > >>>> * block device and read only on demand. > >>>> */ > >>>> assert(size <= BDRV_REQUEST_MAX_BYTES); > >>>> - ret = blk_pread(blk, 0, buf, size); > >>>> + actual_len = bdrv_get_allocated_file_size(blk_bs(blk)); > >>>> + ret = blk_pread(blk, 0, buf, > >>>> + (!blk_is_read_only(blk) || actual_len < 0) ? size : actual_len); > >>>> if (ret < 0) { > >>>> error_setg_errno(errp, -ret, "can't read block backend"); > >>>> return false; > >>>> > >>> > >>> This hunk looks dubious for a general helper function. It seems to > >>> assume that the holes are punched at the end of the file. > >>> > >>> Sorry, this discussion is making me uncomfortable. I don't want to > >>> ignore your questions, but I have no idea about the right answers. This > >>> really needs the attention of the block people. > > > > Yes, this code is wrong. bdrv_get_allocated_file_size() isn't even > > implemented in all block drivers, and when it is implemented it isn't > > necessarily reliable (e.g. return 0 for block devices). It's fine for > > 'qemu-img info', but that's it. > > > > But even if you actually get the correct allocated file size, that's the > > disk space used on the host filesystem, so it doesn't include any holes, > > but it does include image format metadata, the data could possibly be > > compressed etc. In other words, for a guest device it's completely > > meaningless. > > > > I'm not even sure why you would want to do this even in your special > > case of a raw image that is sparse only at the end. Is it an > > optimisation to avoid reading zeros? This ends up basically being > > memset() for sparse blocks, so very quick. And I think you do want to > > zero out the remaining part of the buffer anyway. So it looks to me as > > if it were complicating the code for no use. > > The primary target of this discussion is to save the memory allocated > for UEFI firmware device. On virtual machine, we split it into two flash > devices[1] -- one is read-only in 64MB size and another is read-write in > 64MB size. The qemu commandline is: > > -drive file=QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on \ > -drive file=empty_VARS.fd,if=pflash,format=raw,unit=1 \ > > Regarding the read-only one which are created from QEMU_EFI.fd, the original > QEMU_EFI.fd is only 2MB in size and we need to stuff it to 64MB with 62MB > 'zeros': > > dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64 > dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc > > For some historical reasons such as compatibility and extensibility[2], we > restrict both their size to 64MB -- both UEFI and qemu have cold hard > constants. > > This will consume a large amount of memory when running multiple VM > simultaneously (each guest needs 128MB). Okay, so your problem is that blk_pread() writes to the whole buffer, writing explicit zeroes for unallocated parts of the image, while you would like to leave those parts of the buffer untouched so that we don't actually allocate the memory, but can just use the shared zero page. If you just want to read the non-zero parts of the image, that can be done by using a loop that calls bdrv_block_status() and only reads from the image if the BDRV_BLOCK_ZERO bit is clear. Would this solve your problem? Kevin
On 2019/4/11 20:22, Kevin Wolf wrote: > Okay, so your problem is that blk_pread() writes to the whole buffer, > writing explicit zeroes for unallocated parts of the image, while you > would like to leave those parts of the buffer untouched so that we don't > actually allocate the memory, but can just use the shared zero page. > > If you just want to read the non-zero parts of the image, that can be > done by using a loop that calls bdrv_block_status() and only reads from > the image if the BDRV_BLOCK_ZERO bit is clear. > > Would this solve your problem? Sounds good! What if guest tried to read/write the zero parts?
On 2019/4/11 15:15, Markus Armbruster wrote: >> For some historical reasons such as compatibility and extensibility[2], we >> restrict both their size to 64MB -- both UEFI and qemu have cold hard >> constants. > These reasons aren't historical. But they're valid, and that's all that > matters :) > >> This will consume a large amount of memory when running multiple VM >> simultaneously (each guest needs 128MB). > Understood. > >> There are accepted ideas proposed by Markus and Laszlo from prior discussion[3]: >> >> 1) Getting flash memory size from a machine type or property. >> >> 2) Map the read-only part so it is shared among guests; Map the read-write >> part normally. >> >> The first idea called explicit configuration which may break migration. > For any configuration that can break migration, the solution is to have > identical configuration on source and destination. > >> For the second idea, the only way I can imagine is using a file mapping to >> read or write pflash devices so that we can allocate memory on demand. So I >> tried to fit mmap() actions into the block layer[4]. However I am not sure >> that maping image file whether can work fine for block layer. > It adds an odd special case to the block layer. But then flash devices > are an odd user of the block layer. > > Normal users use the block layer as a *block* layer: to read and write > blocks. > > Our flash devices don't do that. They are *memory* devices, not *block* > devices. Pretty much the only thing the two have in common is > persistence of data. The block layer provides persistence, and I figure > that's why it got pressed into service here. > > What the flash devices do is keep the complete image contents in memory, > with changes written through to disk. > > This is pretty much exactly what mmap() does, so using it suggests > itself. > > It's very much *not* what the block layer does. mmap() is not even > possible for anything but a file backend. > > With mmap(), memory gets faulted in on demand, and sharing read-only > memory with other VMs should be feasible. The question is how to use it > here. Is there a sane way to use it without rewriting the flash devices > not to use the block layer? There are two important changes we may need to pay more attention to if we use mmap(): 1) Sync/Update flash content from guest memory to a file backend. With mmap() we might not have to do the extra sync/update as the block layer do in pflash_write(). 2) Live migration. > >> On the other hand, Laszlo had mentioned the features of QCOW2 and sparse >> files which may be nice for pflash block backends. But I am not familiar with >> it. > Offhand, two QCOW2 features come to mind: > > 1. Sparse storage even when the storage substrate can't do sparse. I > consider this one largely irrelevant. Any decent file system can do > sparse. Archival and network transfer can compress instead. > > 2. Snapshots to go with disk snapshots. *Live* snapshots (the ones that > include memory) already have the flash contents in the memory part, so > this sems only relevant for "dead" snapshots. > > Anything else? > No more, thanks.:D Kevin came up with a good idea which might solve my problem. See https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg01865.html
On 2019/4/12 9:52, Xiang Zheng wrote: > On 2019/4/11 20:22, Kevin Wolf wrote: >> Okay, so your problem is that blk_pread() writes to the whole buffer, >> writing explicit zeroes for unallocated parts of the image, while you >> would like to leave those parts of the buffer untouched so that we don't >> actually allocate the memory, but can just use the shared zero page. >> >> If you just want to read the non-zero parts of the image, that can be >> done by using a loop that calls bdrv_block_status() and only reads from >> the image if the BDRV_BLOCK_ZERO bit is clear. >> >> Would this solve your problem? > > Sounds good! What if guest tried to read/write the zero parts? > I wrote the below patch (refer to bdrv_make_zero()) for test, it seems that everything is OK and the memory is also exactly allocated on demand. This requires pflash devices to use sparse files backend. Thus I have to create images like: dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 ---8>--- diff --git a/block/block-backend.c b/block/block-backend.c index f78e82a..ed8ca87 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset, flags | BDRV_REQ_ZERO_WRITE, cb, opaque); } +int blk_pread_nonzeroes(BlockBackend *blk, void *buf) +{ + int ret = bdrv_pread_nonzeroes(blk->root, buf); + return ret; +} + int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count) { int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0); diff --git a/block/io.c b/block/io.c index dfc153b..83e5ea7 100644 --- a/block/io.c +++ b/block/io.c @@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset, BDRV_REQ_ZERO_WRITE | flags); } +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf) +{ + int ret; + int64_t target_size, bytes, offset = 0; + BlockDriverState *bs = child->bs; + + target_size = bdrv_getlength(bs); + if (target_size < 0) { + return target_size; + } + + for (;;) { + bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES); + if (bytes <= 0) { + return 0; + } + ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL); + if (ret < 0) { + return ret; + } + if (ret & BDRV_BLOCK_ZERO) { + offset += bytes; + continue; + } + ret = bdrv_pread(child, offset, buf, bytes); + if (ret < 0) { + return ret; + } + offset += bytes; + } +} + /* * Completely zero out a block device with the help of bdrv_pwrite_zeroes. * The operation is sped up by checking the block status and only writing diff --git a/hw/block/block.c b/hw/block/block.c index bf56c76..e3c67f8 100644 --- a/hw/block/block.c +++ b/hw/block/block.c @@ -53,7 +53,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, * block device and read only on demand. */ assert(size <= BDRV_REQUEST_MAX_BYTES); - ret = blk_pread(blk, 0, buf, size); + ret = blk_pread_nonzeroes(blk, buf); if (ret < 0) { error_setg_errno(errp, -ret, "can't read block backend"); return false; diff --git a/include/block/block.h b/include/block/block.h index c7a2619..d0e06cf 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -322,6 +322,7 @@ int bdrv_write(BdrvChild *child, int64_t sector_num, const uint8_t *buf, int nb_sectors); int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset, int bytes, BdrvRequestFlags flags); +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf); int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags); int bdrv_pread(BdrvChild *child, int64_t offset, void *buf, int bytes); int bdrv_preadv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov); diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h index 3be05c2..5d349d2 100644 --- a/include/sysemu/block-backend.h +++ b/include/sysemu/block-backend.h @@ -129,6 +129,7 @@ int blk_pwrite_zeroes(BlockBackend *blk, int64_t offset, BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset, int bytes, BdrvRequestFlags flags, BlockCompletionFunc *cb, void *opaque); +int blk_pread_nonzeroes(BlockBackend *blk, void *buf); int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags); int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int bytes); int blk_pwrite(BlockBackend *blk, int64_t offset, const void *buf, int bytes,
Am 12.04.2019 um 11:50 hat Xiang Zheng geschrieben: > > On 2019/4/12 9:52, Xiang Zheng wrote: > > On 2019/4/11 20:22, Kevin Wolf wrote: > >> Okay, so your problem is that blk_pread() writes to the whole buffer, > >> writing explicit zeroes for unallocated parts of the image, while you > >> would like to leave those parts of the buffer untouched so that we don't > >> actually allocate the memory, but can just use the shared zero page. > >> > >> If you just want to read the non-zero parts of the image, that can be > >> done by using a loop that calls bdrv_block_status() and only reads from > >> the image if the BDRV_BLOCK_ZERO bit is clear. > >> > >> Would this solve your problem? > > > > Sounds good! What if guest tried to read/write the zero parts? > > > > I wrote the below patch (refer to bdrv_make_zero()) for test, it seems > that everything is OK and the memory is also exactly allocated on demand. > > This requires pflash devices to use sparse files backend. Thus I have to > create images like: > > dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 > dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc > > dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 > > > ---8>--- > > diff --git a/block/block-backend.c b/block/block-backend.c > index f78e82a..ed8ca87 100644 > --- a/block/block-backend.c > +++ b/block/block-backend.c > @@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset, > flags | BDRV_REQ_ZERO_WRITE, cb, opaque); > } > > +int blk_pread_nonzeroes(BlockBackend *blk, void *buf) > +{ > + int ret = bdrv_pread_nonzeroes(blk->root, buf); > + return ret; > +} I don't think this deserves a place in the public block layer interface, as it's only a single device that makes use of it. Maybe you wrote things this way because there is no blk_block_status(), but you can get the BlockDriverState with blk_bs(blk) and then implement everything inside hw/block/block.c. > int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count) > { > int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0); > diff --git a/block/io.c b/block/io.c > index dfc153b..83e5ea7 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset, > BDRV_REQ_ZERO_WRITE | flags); > } > > +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf) > +{ > + int ret; > + int64_t target_size, bytes, offset = 0; > + BlockDriverState *bs = child->bs; > + > + target_size = bdrv_getlength(bs); > + if (target_size < 0) { > + return target_size; > + } > + > + for (;;) { > + bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES); > + if (bytes <= 0) { > + return 0; > + } > + ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL); > + if (ret < 0) { > + return ret; > + } > + if (ret & BDRV_BLOCK_ZERO) { > + offset += bytes; > + continue; > + } > + ret = bdrv_pread(child, offset, buf, bytes); > + if (ret < 0) { > + return ret; > + } > + offset += bytes; I think the code becomes simpler the other way round: if (!(ret & BDRV_BLOCK_ZERO)) { ret = bdrv_pread(child, offset, buf, bytes); if (ret < 0) { return ret; } } offset += bytes; You don't increment buf, so if you have a hole in the file, this will corrupt the buffer. You need to either increment buf, too, or use (uint8_t*) buf + offset for the bdrv_pread() call. > + } > +} > + > /* > * Completely zero out a block device with the help of bdrv_pwrite_zeroes. > * The operation is sped up by checking the block status and only writing Kevin
On 2019/4/12 18:57, Kevin Wolf wrote: > Am 12.04.2019 um 11:50 hat Xiang Zheng geschrieben: >> >> On 2019/4/12 9:52, Xiang Zheng wrote: >>> On 2019/4/11 20:22, Kevin Wolf wrote: >>>> Okay, so your problem is that blk_pread() writes to the whole buffer, >>>> writing explicit zeroes for unallocated parts of the image, while you >>>> would like to leave those parts of the buffer untouched so that we don't >>>> actually allocate the memory, but can just use the shared zero page. >>>> >>>> If you just want to read the non-zero parts of the image, that can be >>>> done by using a loop that calls bdrv_block_status() and only reads from >>>> the image if the BDRV_BLOCK_ZERO bit is clear. >>>> >>>> Would this solve your problem? >>> >>> Sounds good! What if guest tried to read/write the zero parts? >>> >> >> I wrote the below patch (refer to bdrv_make_zero()) for test, it seems >> that everything is OK and the memory is also exactly allocated on demand. >> >> This requires pflash devices to use sparse files backend. Thus I have to >> create images like: >> >> dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 >> dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc >> >> dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 >> >> >> ---8>--- >> >> diff --git a/block/block-backend.c b/block/block-backend.c >> index f78e82a..ed8ca87 100644 >> --- a/block/block-backend.c >> +++ b/block/block-backend.c >> @@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset, >> flags | BDRV_REQ_ZERO_WRITE, cb, opaque); >> } >> >> +int blk_pread_nonzeroes(BlockBackend *blk, void *buf) >> +{ >> + int ret = bdrv_pread_nonzeroes(blk->root, buf); >> + return ret; >> +} > > I don't think this deserves a place in the public block layer interface, > as it's only a single device that makes use of it. > > Maybe you wrote things this way because there is no blk_block_status(), > but you can get the BlockDriverState with blk_bs(blk) and then implement > everything inside hw/block/block.c. Yes, you are right. > >> int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count) >> { >> int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0); >> diff --git a/block/io.c b/block/io.c >> index dfc153b..83e5ea7 100644 >> --- a/block/io.c >> +++ b/block/io.c >> @@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset, >> BDRV_REQ_ZERO_WRITE | flags); >> } >> >> +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf) >> +{ >> + int ret; >> + int64_t target_size, bytes, offset = 0; >> + BlockDriverState *bs = child->bs; >> + >> + target_size = bdrv_getlength(bs); >> + if (target_size < 0) { >> + return target_size; >> + } >> + >> + for (;;) { >> + bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES); >> + if (bytes <= 0) { >> + return 0; >> + } >> + ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL); >> + if (ret < 0) { >> + return ret; >> + } >> + if (ret & BDRV_BLOCK_ZERO) { >> + offset += bytes; >> + continue; >> + } >> + ret = bdrv_pread(child, offset, buf, bytes); >> + if (ret < 0) { >> + return ret; >> + } >> + offset += bytes; > > I think the code becomes simpler the other way round: > > if (!(ret & BDRV_BLOCK_ZERO)) { > ret = bdrv_pread(child, offset, buf, bytes); > if (ret < 0) { > return ret; > } > } > offset += bytes; > > You don't increment buf, so if you have a hole in the file, this will > corrupt the buffer. You need to either increment buf, too, or use > (uint8_t*) buf + offset for the bdrv_pread() call. > Yes, I didn't notice it. I think the latter is better. Does *BDRV_BLOCK_ZERO* mean that there are all-zeroes data or a hole in the sector? But if I use an image filled with zeroes, it will not set BDRV_BLOCK_ZERO bit on return. Should I resend a patch? ---8>--- From 4dbfe4955aa9fe23404cbe1890fbe148be2ff10e Mon Sep 17 00:00:00 2001 From: Xiang Zheng <zhengxiang9@huawei.com> Date: Sat, 13 Apr 2019 02:27:03 +0800 Subject: [PATCH] pflash: Only read non-zero parts of backend image Currently we fill the VIRT_FLASH memory space with two 64MB NOR images when using persistent UEFI variables on virt board. Actually we only use a very small(non-zero) part of the memory while the rest significant large(zero) part of memory is wasted. So this patch checks the block status and only writes the non-zero part into memory. This requires pflash devices to use sparse files for backends. Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com> --- hw/block/block.c | 40 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 39 insertions(+), 1 deletion(-) diff --git a/hw/block/block.c b/hw/block/block.c index bf56c76..3cb9d4c 100644 --- a/hw/block/block.c +++ b/hw/block/block.c @@ -15,6 +15,44 @@ #include "qapi/qapi-types-block.h" /* + * Read the non-zeroes parts of @blk into @buf + * Reading all of the @blk is expensive if the zeroes parts of @blk + * is large enough. Therefore check the block status and only write + * the non-zeroes block into @buf. + * + * Return 0 on success, non-zero on error. + */ +static int blk_pread_nonzeroes(BlockBackend *blk, void *buf) +{ + int ret; + int64_t target_size, bytes, offset = 0; + BlockDriverState *bs = blk_bs(blk); + + target_size = bdrv_getlength(bs); + if (target_size < 0) { + return target_size; + } + + for (;;) { + bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_SECTORS); + if (bytes <= 0) { + return 0; + } + ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL); + if (ret < 0) { + return ret; + } + if (!(ret & BDRV_BLOCK_ZERO)) { + ret = bdrv_pread(bs->file, offset, (uint8_t *) buf + offset, bytes); + if (ret < 0) { + return ret; + } + } + offset += bytes; + } +} + +/* * Read the entire contents of @blk into @buf. * @blk's contents must be @size bytes, and @size must be at most * BDRV_REQUEST_MAX_BYTES. @@ -53,7 +91,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, * block device and read only on demand. */ assert(size <= BDRV_REQUEST_MAX_BYTES); - ret = blk_pread(blk, 0, buf, size); + ret = blk_pread_nonzeroes(blk, buf); if (ret < 0) { error_setg_errno(errp, -ret, "can't read block backend"); return false;
Ping? On 2019/4/15 10:39, Xiang Zheng wrote: > On 2019/4/12 18:57, Kevin Wolf wrote: >> Am 12.04.2019 um 11:50 hat Xiang Zheng geschrieben: >>> >>> On 2019/4/12 9:52, Xiang Zheng wrote: >>>> On 2019/4/11 20:22, Kevin Wolf wrote: >>>>> Okay, so your problem is that blk_pread() writes to the whole buffer, >>>>> writing explicit zeroes for unallocated parts of the image, while you >>>>> would like to leave those parts of the buffer untouched so that we don't >>>>> actually allocate the memory, but can just use the shared zero page. >>>>> >>>>> If you just want to read the non-zero parts of the image, that can be >>>>> done by using a loop that calls bdrv_block_status() and only reads from >>>>> the image if the BDRV_BLOCK_ZERO bit is clear. >>>>> >>>>> Would this solve your problem? >>>> >>>> Sounds good! What if guest tried to read/write the zero parts? >>>> >>> >>> I wrote the below patch (refer to bdrv_make_zero()) for test, it seems >>> that everything is OK and the memory is also exactly allocated on demand. >>> >>> This requires pflash devices to use sparse files backend. Thus I have to >>> create images like: >>> >>> dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 >>> dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc >>> >>> dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 >>> >>> >>> ---8>--- >>> >>> diff --git a/block/block-backend.c b/block/block-backend.c >>> index f78e82a..ed8ca87 100644 >>> --- a/block/block-backend.c >>> +++ b/block/block-backend.c >>> @@ -1379,6 +1379,12 @@ BlockAIOCB *blk_aio_pwrite_zeroes(BlockBackend *blk, int64_t offset, >>> flags | BDRV_REQ_ZERO_WRITE, cb, opaque); >>> } >>> >>> +int blk_pread_nonzeroes(BlockBackend *blk, void *buf) >>> +{ >>> + int ret = bdrv_pread_nonzeroes(blk->root, buf); >>> + return ret; >>> +} >> >> I don't think this deserves a place in the public block layer interface, >> as it's only a single device that makes use of it. >> >> Maybe you wrote things this way because there is no blk_block_status(), >> but you can get the BlockDriverState with blk_bs(blk) and then implement >> everything inside hw/block/block.c. > > Yes, you are right. > >> >>> int blk_pread(BlockBackend *blk, int64_t offset, void *buf, int count) >>> { >>> int ret = blk_prw(blk, offset, buf, count, blk_read_entry, 0); >>> diff --git a/block/io.c b/block/io.c >>> index dfc153b..83e5ea7 100644 >>> --- a/block/io.c >>> +++ b/block/io.c >>> @@ -882,6 +882,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset, >>> BDRV_REQ_ZERO_WRITE | flags); >>> } >>> >>> +int bdrv_pread_nonzeroes(BdrvChild *child, void *buf) >>> +{ >>> + int ret; >>> + int64_t target_size, bytes, offset = 0; >>> + BlockDriverState *bs = child->bs; >>> + >>> + target_size = bdrv_getlength(bs); >>> + if (target_size < 0) { >>> + return target_size; >>> + } >>> + >>> + for (;;) { >>> + bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES); >>> + if (bytes <= 0) { >>> + return 0; >>> + } >>> + ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL); >>> + if (ret < 0) { >>> + return ret; >>> + } >>> + if (ret & BDRV_BLOCK_ZERO) { >>> + offset += bytes; >>> + continue; >>> + } >>> + ret = bdrv_pread(child, offset, buf, bytes); >>> + if (ret < 0) { >>> + return ret; >>> + } >>> + offset += bytes; >> >> I think the code becomes simpler the other way round: >> >> if (!(ret & BDRV_BLOCK_ZERO)) { >> ret = bdrv_pread(child, offset, buf, bytes); >> if (ret < 0) { >> return ret; >> } >> } >> offset += bytes; >> >> You don't increment buf, so if you have a hole in the file, this will >> corrupt the buffer. You need to either increment buf, too, or use >> (uint8_t*) buf + offset for the bdrv_pread() call. >> > > Yes, I didn't notice it. I think the latter is better. Does *BDRV_BLOCK_ZERO* > mean that there are all-zeroes data or a hole in the sector? But if I use an > image filled with zeroes, it will not set BDRV_BLOCK_ZERO bit on return. > > Should I resend a patch? > > ---8>--- > >>From 4dbfe4955aa9fe23404cbe1890fbe148be2ff10e Mon Sep 17 00:00:00 2001 > From: Xiang Zheng <zhengxiang9@huawei.com> > Date: Sat, 13 Apr 2019 02:27:03 +0800 > Subject: [PATCH] pflash: Only read non-zero parts of backend image > > Currently we fill the VIRT_FLASH memory space with two 64MB NOR images > when using persistent UEFI variables on virt board. Actually we only use > a very small(non-zero) part of the memory while the rest significant > large(zero) part of memory is wasted. > > So this patch checks the block status and only writes the non-zero part > into memory. This requires pflash devices to use sparse files for > backends. > > Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com> > --- > hw/block/block.c | 40 +++++++++++++++++++++++++++++++++++++++- > 1 file changed, 39 insertions(+), 1 deletion(-) > > diff --git a/hw/block/block.c b/hw/block/block.c > index bf56c76..3cb9d4c 100644 > --- a/hw/block/block.c > +++ b/hw/block/block.c > @@ -15,6 +15,44 @@ > #include "qapi/qapi-types-block.h" > > /* > + * Read the non-zeroes parts of @blk into @buf > + * Reading all of the @blk is expensive if the zeroes parts of @blk > + * is large enough. Therefore check the block status and only write > + * the non-zeroes block into @buf. > + * > + * Return 0 on success, non-zero on error. > + */ > +static int blk_pread_nonzeroes(BlockBackend *blk, void *buf) > +{ > + int ret; > + int64_t target_size, bytes, offset = 0; > + BlockDriverState *bs = blk_bs(blk); > + > + target_size = bdrv_getlength(bs); > + if (target_size < 0) { > + return target_size; > + } > + > + for (;;) { > + bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_SECTORS); > + if (bytes <= 0) { > + return 0; > + } > + ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL); > + if (ret < 0) { > + return ret; > + } > + if (!(ret & BDRV_BLOCK_ZERO)) { > + ret = bdrv_pread(bs->file, offset, (uint8_t *) buf + offset, bytes); > + if (ret < 0) { > + return ret; > + } > + } > + offset += bytes; > + } > +} > + > +/* > * Read the entire contents of @blk into @buf. > * @blk's contents must be @size bytes, and @size must be at most > * BDRV_REQUEST_MAX_BYTES. > @@ -53,7 +91,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void *buf, hwaddr size, > * block device and read only on demand. > */ > assert(size <= BDRV_REQUEST_MAX_BYTES); > - ret = blk_pread(blk, 0, buf, size); > + ret = blk_pread_nonzeroes(blk, buf); > if (ret < 0) { > error_setg_errno(errp, -ret, "can't read block backend"); > return false; >
diff --git a/hw/arm/virt.c b/hw/arm/virt.c index ce2664a..8269532 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -44,6 +44,7 @@ #include "sysemu/numa.h" #include "sysemu/sysemu.h" #include "sysemu/kvm.h" +#include "sysemu/block-backend.h" #include "hw/loader.h" #include "exec/address-spaces.h" #include "qemu/bitops.h" @@ -881,14 +882,18 @@ static void create_one_flash(const char *name, hwaddr flashbase, DriveInfo *dinfo = drive_get_next(IF_PFLASH); DeviceState *dev = qdev_create(NULL, TYPE_PFLASH_CFI01); SysBusDevice *sbd = SYS_BUS_DEVICE(dev); + BlockBackend *blk_backend = NULL; const uint64_t sectorlength = 256 * 1024; + hwaddr real_size = flashsize; if (dinfo) { - qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo), - &error_abort); + blk_backend = blk_by_legacy_dinfo(dinfo); + qdev_prop_set_drive(dev, "drive", blk_backend, &error_abort); + real_size = blk_getlength(blk_backend); + real_size = flashsize > real_size ? real_size : flashsize; } - qdev_prop_set_uint32(dev, "num-blocks", flashsize / sectorlength); + qdev_prop_set_uint32(dev, "num-blocks", real_size / sectorlength); qdev_prop_set_uint64(dev, "sector-length", sectorlength); qdev_prop_set_uint8(dev, "width", 4); qdev_prop_set_uint8(dev, "device-width", 2);
Currently we fill the VIRT_FLASH space with two 64MB NOR images when using persistent UEFI variables on QEMU. Actually we only use a very small part of the memory while the rest significant large part of memory is wasted. This patch creates and maps a variable size of flash device instead of a mandatory 64MB one to save memory. Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com> --- This patch might be insufficient since it also needs to modify the flash size in ACPI and DTB. BTW, I don't understand why it requires the two NOR images to be exactly 64MB in size when using -pflash. --- hw/arm/virt.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)