mbox series

[v4,0/8] Introduce the microvm machine type

Message ID 20190924124433.96810-1-slp@redhat.com (mailing list archive)
Headers show
Series Introduce the microvm machine type | expand

Message

Sergio Lopez Pascual Sept. 24, 2019, 12:44 p.m. UTC
Microvm is a machine type inspired by both NEMU and Firecracker, and
constructed after the machine model implemented by the latter.

It's main purpose is providing users a minimalist machine type free
from the burden of legacy compatibility, serving as a stepping stone
for future projects aiming at improving boot times, reducing the
attack surface and slimming down QEMU's footprint.

The microvm machine type supports the following devices:

 - ISA bus
 - i8259 PIC
 - LAPIC (implicit if using KVM)
 - IOAPIC (defaults to kernel_irqchip_split = true)
 - i8254 PIT
 - MC146818 RTC (optional)
 - kvmclock (if using KVM)
 - fw_cfg
 - One ISA serial port (optional)
 - Up to eight virtio-mmio devices (configured by the user)

It supports the following machine-specific options:

microvm.option-roms=bool (Set off to disable loading option ROMs)
microvm.isa-serial=bool (Set off to disable the instantiation an ISA serial port)
microvm.rtc=bool (Set off to disable the instantiation of an MC146818 RTC)
microvm.kernel-cmdline=bool (Set off to disable adding virtio-mmio devices to the kernel cmdline)

By default, microvm uses qboot as its BIOS, to obtain better boot
times, but it's also compatible with SeaBIOS.

As no current FW is able to boot from a block device using virtio-mmio
as its transport, a microvm-based VM needs to be run using a host-side
kernel and, optionally, an initrd image.

This is an example of instantiating a microvm VM with a virtio-mmio
based console:

qemu-system-x86_64 -M microvm
 -enable-kvm -cpu host -m 512m -smp 2 \
 -kernel vmlinux -append "console=hvc0 root=/dev/vda" \
 -nodefaults -no-user-config -nographic \
 -chardev stdio,id=virtiocon0,server \
 -device virtio-serial-device \
 -device virtconsole,chardev=virtiocon0 \
 -drive id=test,file=test.img,format=raw,if=none \
 -device virtio-blk-device,drive=test \
 -netdev tap,id=tap0,script=no,downscript=no \
 -device virtio-net-device,netdev=tap0

This is another example, this time using an ISA serial port, useful
for debugging purposes:

qemu-system-x86_64 -M microvm \
 -enable-kvm -cpu host -m 512m -smp 2 \
 -kernel vmlinux -append "earlyprintk=ttyS0 console=ttyS0 root=/dev/vda" \
 -nodefaults -no-user-config -nographic \
 -serial stdio \
 -drive id=test,file=test.img,format=raw,if=none \
 -device virtio-blk-device,drive=test \
 -netdev tap,id=tap0,script=no,downscript=no \
 -device virtio-net-device,netdev=tap0

Finally, in this example a microvm VM is instantiated without RTC,
without an ISA serial port and without loading the option ROMs,
obtaining the smallest configuration:

qemu-system-x86_64 -M microvm,rtc=off,isa-serial=off,option-roms=off \
 -enable-kvm -cpu host -m 512m -smp 2 \
 -kernel vmlinux -append "console=hvc0 root=/dev/vda" \
 -nodefaults -no-user-config -nographic \
 -chardev stdio,id=virtiocon0,server \
 -device virtio-serial-device \
 -device virtconsole,chardev=virtiocon0 \
 -drive id=test,file=test.img,format=raw,if=none \
 -device virtio-blk-device,drive=test \
 -netdev tap,id=tap0,script=no,downscript=no \
 -device virtio-net-device,netdev=tap0

---

Changelog
v4:
 - This is a complete rewrite of the whole patchset, with a focus on
   reusing as much existing code as possible to ease the maintenance burden
   and making the machine type as compatible as possible by default. As
   a result, the number of lines dedicated specifically to microvm is
   383 (code lines measured by "cloc") and, with the default
   configuration, it's now able to boot both PVH ELF images and
   bzImages with either SeaBIOS or qboot.

v3:
  - Add initrd support (thanks Stefano).

v2:
  - Drop "[PATCH 1/4] hw/i386: Factorize CPU routine".
  - Simplify machine definition (thanks Eduardo).
  - Remove use of unneeded NUMA-related callbacks (thanks Eduardo).
  - Add a patch to factorize PVH-related functions.
  - Replace use of Linux's Zero Page with PVH (thanks Maran and Paolo).
  
---
Sergio Lopez (8):
  hw/i386: Factorize PVH related functions
  hw/i386: Factorize e820 related functions
  hw/virtio: Factorize virtio-mmio headers
  hw/i386: split PCMachineState deriving X86MachineState from it
  fw_cfg: add "modify" functions for all types
  roms: add microvm-bios (qboot) as binary and git submodule
  docs/microvm.txt: document the new microvm machine type
  hw/i386: Introduce the microvm machine type

 .gitmodules                      |   3 +
 default-configs/i386-softmmu.mak |   1 +
 docs/microvm.txt                 |  78 +++
 hw/acpi/cpu_hotplug.c            |  10 +-
 hw/i386/Kconfig                  |   4 +
 hw/i386/Makefile.objs            |   4 +
 hw/i386/acpi-build.c             |  31 +-
 hw/i386/amd_iommu.c              |   4 +-
 hw/i386/e820.c                   |  99 ++++
 hw/i386/e820.h                   |  11 +
 hw/i386/intel_iommu.c            |   4 +-
 hw/i386/microvm.c                | 512 +++++++++++++++++
 hw/i386/pc.c                     | 960 +++----------------------------
 hw/i386/pc_piix.c                |  48 +-
 hw/i386/pc_q35.c                 |  38 +-
 hw/i386/pc_sysfw.c               |  60 +-
 hw/i386/pvh.c                    | 113 ++++
 hw/i386/pvh.h                    |  10 +
 hw/i386/x86.c                    | 788 +++++++++++++++++++++++++
 hw/intc/ioapic.c                 |   3 +-
 hw/nvram/fw_cfg.c                |  29 +
 hw/virtio/virtio-mmio.c          |  35 +-
 include/hw/i386/microvm.h        |  80 +++
 include/hw/i386/pc.h             |  40 +-
 include/hw/i386/x86.h            |  97 ++++
 include/hw/nvram/fw_cfg.h        |  42 ++
 include/hw/virtio/virtio-mmio.h  |  60 ++
 pc-bios/bios-microvm.bin         | Bin 0 -> 65536 bytes
 roms/Makefile                    |   6 +
 roms/qboot                       |   1 +
 target/i386/kvm.c                |   1 +
 31 files changed, 2102 insertions(+), 1070 deletions(-)
 create mode 100644 docs/microvm.txt
 create mode 100644 hw/i386/e820.c
 create mode 100644 hw/i386/e820.h
 create mode 100644 hw/i386/microvm.c
 create mode 100644 hw/i386/pvh.c
 create mode 100644 hw/i386/pvh.h
 create mode 100644 hw/i386/x86.c
 create mode 100644 include/hw/i386/microvm.h
 create mode 100644 include/hw/i386/x86.h
 create mode 100644 include/hw/virtio/virtio-mmio.h
 create mode 100755 pc-bios/bios-microvm.bin
 create mode 160000 roms/qboot

Comments

Peter Maydell Sept. 24, 2019, 1:57 p.m. UTC | #1
On Tue, 24 Sep 2019 at 14:25, Sergio Lopez <slp@redhat.com> wrote:
>
> Microvm is a machine type inspired by both NEMU and Firecracker, and
> constructed after the machine model implemented by the latter.
>
> It's main purpose is providing users a minimalist machine type free
> from the burden of legacy compatibility, serving as a stepping stone
> for future projects aiming at improving boot times, reducing the
> attack surface and slimming down QEMU's footprint.


>  docs/microvm.txt                 |  78 +++

I'm not sure how close to acceptance this patchset is at the
moment, so not necessarily something you need to do now,
but could new documentation in docs/ be in rst format, not
plain text, please? (Ideally also they should be in the right
manual subdirectory, but documentation of system emulation
machines at the moment is still in texinfo format, so we
don't have a subdir for it yet.)

thanks
-- PMM
Sergio Lopez Pascual Sept. 25, 2019, 5:51 a.m. UTC | #2
Peter Maydell <peter.maydell@linaro.org> writes:

> On Tue, 24 Sep 2019 at 14:25, Sergio Lopez <slp@redhat.com> wrote:
>>
>> Microvm is a machine type inspired by both NEMU and Firecracker, and
>> constructed after the machine model implemented by the latter.
>>
>> It's main purpose is providing users a minimalist machine type free
>> from the burden of legacy compatibility, serving as a stepping stone
>> for future projects aiming at improving boot times, reducing the
>> attack surface and slimming down QEMU's footprint.
>
>
>>  docs/microvm.txt                 |  78 +++
>
> I'm not sure how close to acceptance this patchset is at the
> moment, so not necessarily something you need to do now,
> but could new documentation in docs/ be in rst format, not
> plain text, please? (Ideally also they should be in the right
> manual subdirectory, but documentation of system emulation
> machines at the moment is still in texinfo format, so we
> don't have a subdir for it yet.)

Sure. What I didn't get is, should I put it in "docs/microvm.rst" or in
some other subdirectory?

Thanks,
Sergio.
David Hildenbrand Sept. 25, 2019, 7:41 a.m. UTC | #3
On 24.09.19 14:44, Sergio Lopez wrote:
> Microvm is a machine type inspired by both NEMU and Firecracker, and
> constructed after the machine model implemented by the latter.
> 
> It's main purpose is providing users a minimalist machine type free
> from the burden of legacy compatibility, serving as a stepping stone
> for future projects aiming at improving boot times, reducing the
> attack surface and slimming down QEMU's footprint.
> 
> The microvm machine type supports the following devices:
> 
>  - ISA bus
>  - i8259 PIC
>  - LAPIC (implicit if using KVM)
>  - IOAPIC (defaults to kernel_irqchip_split = true)
>  - i8254 PIT
>  - MC146818 RTC (optional)
>  - kvmclock (if using KVM)
>  - fw_cfg
>  - One ISA serial port (optional)
>  - Up to eight virtio-mmio devices (configured by the user)

So I assume also no ACPI (CPU/memory hotplug), correct?

@Pankaj, I think it would make sense to make virtio-pmem play with
virtio-mmio/microvm.
Pankaj Gupta Sept. 25, 2019, 7:58 a.m. UTC | #4
> On 24.09.19 14:44, Sergio Lopez wrote:
> > Microvm is a machine type inspired by both NEMU and Firecracker, and
> > constructed after the machine model implemented by the latter.
> > 
> > It's main purpose is providing users a minimalist machine type free
> > from the burden of legacy compatibility, serving as a stepping stone
> > for future projects aiming at improving boot times, reducing the
> > attack surface and slimming down QEMU's footprint.
> > 
> > The microvm machine type supports the following devices:
> > 
> >  - ISA bus
> >  - i8259 PIC
> >  - LAPIC (implicit if using KVM)
> >  - IOAPIC (defaults to kernel_irqchip_split = true)
> >  - i8254 PIT
> >  - MC146818 RTC (optional)
> >  - kvmclock (if using KVM)
> >  - fw_cfg
> >  - One ISA serial port (optional)
> >  - Up to eight virtio-mmio devices (configured by the user)
> 
> So I assume also no ACPI (CPU/memory hotplug), correct?
> 
> @Pankaj, I think it would make sense to make virtio-pmem play with
> virtio-mmio/microvm.

I agree. Its using virtio-blk device over a raw image.
Similarly or alternatively(as an experiment) we can use virtio-pmem
which will even reduce the guest memory footprint. 

Best regards,
Pankaj

> 
> --
> 
> Thanks,
> 
> David / dhildenb
>
Sergio Lopez Pascual Sept. 25, 2019, 8:10 a.m. UTC | #5
David Hildenbrand <david@redhat.com> writes:

> On 24.09.19 14:44, Sergio Lopez wrote:
>> Microvm is a machine type inspired by both NEMU and Firecracker, and
>> constructed after the machine model implemented by the latter.
>> 
>> It's main purpose is providing users a minimalist machine type free
>> from the burden of legacy compatibility, serving as a stepping stone
>> for future projects aiming at improving boot times, reducing the
>> attack surface and slimming down QEMU's footprint.
>> 
>> The microvm machine type supports the following devices:
>> 
>>  - ISA bus
>>  - i8259 PIC
>>  - LAPIC (implicit if using KVM)
>>  - IOAPIC (defaults to kernel_irqchip_split = true)
>>  - i8254 PIT
>>  - MC146818 RTC (optional)
>>  - kvmclock (if using KVM)
>>  - fw_cfg
>>  - One ISA serial port (optional)
>>  - Up to eight virtio-mmio devices (configured by the user)
>
> So I assume also no ACPI (CPU/memory hotplug), correct?

Correct.

> @Pankaj, I think it would make sense to make virtio-pmem play with
> virtio-mmio/microvm.

That would be great. I'm also looking forward for virtio-mem (and an
hypothetical virtio-cpu) to eventually gain hotplug capabilities in
microvm.

Thanks,
Sergio.
David Hildenbrand Sept. 25, 2019, 8:16 a.m. UTC | #6
On 25.09.19 10:10, Sergio Lopez wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 24.09.19 14:44, Sergio Lopez wrote:
>>> Microvm is a machine type inspired by both NEMU and Firecracker, and
>>> constructed after the machine model implemented by the latter.
>>>
>>> It's main purpose is providing users a minimalist machine type free
>>> from the burden of legacy compatibility, serving as a stepping stone
>>> for future projects aiming at improving boot times, reducing the
>>> attack surface and slimming down QEMU's footprint.
>>>
>>> The microvm machine type supports the following devices:
>>>
>>>  - ISA bus
>>>  - i8259 PIC
>>>  - LAPIC (implicit if using KVM)
>>>  - IOAPIC (defaults to kernel_irqchip_split = true)
>>>  - i8254 PIT
>>>  - MC146818 RTC (optional)
>>>  - kvmclock (if using KVM)
>>>  - fw_cfg
>>>  - One ISA serial port (optional)
>>>  - Up to eight virtio-mmio devices (configured by the user)
>>
>> So I assume also no ACPI (CPU/memory hotplug), correct?
> 
> Correct.
> 
>> @Pankaj, I think it would make sense to make virtio-pmem play with
>> virtio-mmio/microvm.
> 
> That would be great. I'm also looking forward for virtio-mem (and an
> hypothetical virtio-cpu) to eventually gain hotplug capabilities in
> microvm.

@Pankaj, do you have time to look into the virtio-pmem thingy? I guess
the virtio-mmio rapper shouldn't be too hard (very similar to the
virtio-pci wrapper - luckily I insisted to make it work independently
from PCI BARs and ACPI slots ;) ). The microvm bits would be properly
setting up device memory and wiring up the hotplug handlers, similar as
done in the other PC machine types (maybe that comes for free?).

virtio-pmem will allow (in read-only mode) to place the rootfs on a fake
NVDIMM, as done e.g., in kata containers. We might have to include the
virtio-pmem kernel module in the initramfs, shouldn't  be too hard. Not
sure what else we'll need to make virtio-pmem get used as a rootfs.

> 
> Thanks,
> Sergio.
>
Paolo Bonzini Sept. 25, 2019, 8:26 a.m. UTC | #7
On 25/09/19 10:10, Sergio Lopez wrote:
> That would be great. I'm also looking forward for virtio-mem (and an
> hypothetical virtio-cpu) to eventually gain hotplug capabilities in
> microvm.

I disagree with this.  virtio is not a silver bullet (and in fact
perhaps it's just me but I've never understood the advantages of
virtio-mem over anything else).

If you want to add hotplug to microvm, you can reuse the existing code
for CPU and memory hotplug controllers, and write drivers for them in
Linux's drivers/platform.  The drivers would basically do what the ACPI
AML tells the interpreter to do.

There is no reason to add the complexity of virtio to something as
low-level and deadlock-prone as CPU hotplug.

Paolo
Pankaj Gupta Sept. 25, 2019, 8:37 a.m. UTC | #8
> >>> Microvm is a machine type inspired by both NEMU and Firecracker, and
> >>> constructed after the machine model implemented by the latter.
> >>>
> >>> It's main purpose is providing users a minimalist machine type free
> >>> from the burden of legacy compatibility, serving as a stepping stone
> >>> for future projects aiming at improving boot times, reducing the
> >>> attack surface and slimming down QEMU's footprint.
> >>>
> >>> The microvm machine type supports the following devices:
> >>>
> >>>  - ISA bus
> >>>  - i8259 PIC
> >>>  - LAPIC (implicit if using KVM)
> >>>  - IOAPIC (defaults to kernel_irqchip_split = true)
> >>>  - i8254 PIT
> >>>  - MC146818 RTC (optional)
> >>>  - kvmclock (if using KVM)
> >>>  - fw_cfg
> >>>  - One ISA serial port (optional)
> >>>  - Up to eight virtio-mmio devices (configured by the user)
> >>
> >> So I assume also no ACPI (CPU/memory hotplug), correct?
> > 
> > Correct.
> > 
> >> @Pankaj, I think it would make sense to make virtio-pmem play with
> >> virtio-mmio/microvm.
> > 
> > That would be great. I'm also looking forward for virtio-mem (and an
> > hypothetical virtio-cpu) to eventually gain hotplug capabilities in
> > microvm.
> 
> @Pankaj, do you have time to look into the virtio-pmem thingy? I guess
> the virtio-mmio rapper shouldn't be too hard (very similar to the
> virtio-pci wrapper - luckily I insisted to make it work independently
> from PCI BARs and ACPI slots ;) ). The microvm bits would be properly
> setting up device memory and wiring up the hotplug handlers, similar as
> done in the other PC machine types (maybe that comes for free?).

Yes, I can look at.

> 
> virtio-pmem will allow (in read-only mode) to place the rootfs on a fake
> NVDIMM, as done e.g., in kata containers. We might have to include the
> virtio-pmem kernel module in the initramfs, shouldn't  be too hard. Not
> sure what else we'll need to make virtio-pmem get used as a rootfs.

Sure, will work on it.

Thanks,
Pankaj

> 
> > 
> > Thanks,
> > Sergio.
> > 
> 
> 
> --
> 
> Thanks,
> 
> David / dhildenb
> 
>
Sergio Lopez Pascual Sept. 25, 2019, 8:42 a.m. UTC | #9
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 25/09/19 10:10, Sergio Lopez wrote:
>> That would be great. I'm also looking forward for virtio-mem (and an
>> hypothetical virtio-cpu) to eventually gain hotplug capabilities in
>> microvm.
>
> I disagree with this.  virtio is not a silver bullet (and in fact
> perhaps it's just me but I've never understood the advantages of
> virtio-mem over anything else).
>
> If you want to add hotplug to microvm, you can reuse the existing code
> for CPU and memory hotplug controllers, and write drivers for them in
> Linux's drivers/platform.  The drivers would basically do what the ACPI
> AML tells the interpreter to do.
>
> There is no reason to add the complexity of virtio to something as
> low-level and deadlock-prone as CPU hotplug.

TBH, I haven't put much thought into this yet. I'll keep this in mind
for the future.

Thanks,
Sergio.
David Hildenbrand Sept. 25, 2019, 8:44 a.m. UTC | #10
On 25.09.19 10:26, Paolo Bonzini wrote:
> On 25/09/19 10:10, Sergio Lopez wrote:
>> That would be great. I'm also looking forward for virtio-mem (and an
>> hypothetical virtio-cpu) to eventually gain hotplug capabilities in
>> microvm.
> 
> I disagree with this.  virtio is not a silver bullet (and in fact
> perhaps it's just me but I've never understood the advantages of
> virtio-mem over anything else).

Sorry, I had to lol about "virtio-mem over anything else". No, not
starting a discussion.

> 
> If you want to add hotplug to microvm, you can reuse the existing code
> for CPU and memory hotplug controllers, and write drivers for them in
> Linux's drivers/platform.  The drivers would basically do what the ACPI
> AML tells the interpreter to do.
> 
> There is no reason to add the complexity of virtio to something as
> low-level and deadlock-prone as CPU hotplug.

I do agree in respect of CPU hotplug complexity (especially accross
architectures), but thinking "outside of the wonderful x86 world", other
architectures impose limitations (e.g., no cpu unplug on s390x - at
least for now) that make something like this very interesting. But yeah,
I already expressed somewhere else my feelings about CPU hotplug.

I consider virtio the silver bullet whenever we want a mature
paravirtualized interface across architectures. And you can tell that
I'm not the only one by the huge amount of virtio device people are
crafting right now.
Gerd Hoffmann Sept. 25, 2019, 9:12 a.m. UTC | #11
Hi,

> If you want to add hotplug to microvm, you can reuse the existing code
> for CPU and memory hotplug controllers, and write drivers for them in
> Linux's drivers/platform.  The drivers would basically do what the ACPI
> AML tells the interpreter to do.

How would the linux kernel detect those devices?

I guess that wouldn't be ACPI, seems everyone wants avoid it[1].

So device tree on x86?  Something else?

cheers,
  Gerd

[1] Not clear to me why, some minimal ACPI tables listing our
    devices (isa-serial, fw_cfg, ...) doesn't look unreasonable
    to me.  We could also make virtio-mmio discoverable that way.
    Also we could do acpi cpu hotplug without having to write those
    linux platform drivers.  We would need a sysbus-acpi device though,
    but given that most acpi code is already separated out so piix and
    q35 can share it it should not be that hard to wire up.
Paolo Bonzini Sept. 25, 2019, 9:29 a.m. UTC | #12
On 25/09/19 11:12, Gerd Hoffmann wrote:
>   Hi,
> 
>> If you want to add hotplug to microvm, you can reuse the existing code
>> for CPU and memory hotplug controllers, and write drivers for them in
>> Linux's drivers/platform.  The drivers would basically do what the ACPI
>> AML tells the interpreter to do.
> 
> How would the linux kernel detect those devices?
>
> I guess that wouldn't be ACPI, seems everyone wants avoid it[1].
> 
> So device tree on x86?  Something else?

Yes, device tree would be great.

> [1] Not clear to me why, some minimal ACPI tables listing our
>     devices (isa-serial, fw_cfg, ...) doesn't look unreasonable
>     to me.

It's not, but ACPI is dog slow and half of the boot time is cut if you
remove it.

> We could also make virtio-mmio discoverable that way.

True, but the simplest way to plumb virtio-mmio into ACPI would be
taking the device tree properties and representing them as _DSD[1].  So
at this point it's just as easy to use directly the device tree.

Paolo

[1]
https://kernel-recipes.org/en/2015/talks/representing-device-tree-peripherals-in-acpi/
David Hildenbrand Sept. 25, 2019, 9:47 a.m. UTC | #13
On 25.09.19 11:12, Gerd Hoffmann wrote:
>   Hi,
> 
>> If you want to add hotplug to microvm, you can reuse the existing code
>> for CPU and memory hotplug controllers, and write drivers for them in
>> Linux's drivers/platform.  The drivers would basically do what the ACPI
>> AML tells the interpreter to do.
> 
> How would the linux kernel detect those devices?
> 
> I guess that wouldn't be ACPI, seems everyone wants avoid it[1].
> 
> So device tree on x86?  Something else?
> 
> cheers,
>   Gerd
> 
> [1] Not clear to me why, some minimal ACPI tables listing our
>     devices (isa-serial, fw_cfg, ...) doesn't look unreasonable
>     to me.  We could also make virtio-mmio discoverable that way.
>     Also we could do acpi cpu hotplug without having to write those
>     linux platform drivers.  We would need a sysbus-acpi device though,
>     but given that most acpi code is already separated out so piix and
>     q35 can share it it should not be that hard to wire up.
> 

Just to make one thing clear, the same could be used for DIMM based
memory hotplug, too. virtio-mem is not simply exposing DIMMs to a guest
using virtio. It's even designed to co-exist with DIMM based memory hotplug.
Paolo Bonzini Sept. 25, 2019, 10:19 a.m. UTC | #14
This is a tangent, but I was a bit too harsh in my previous message (at
least it made you laugh rather than angry!) so I think I owe you an
explanation.

On 25/09/19 10:44, David Hildenbrand wrote:
> I consider virtio the silver bullet whenever we want a mature
> paravirtualized interface across architectures. And you can tell that
> I'm not the only one by the huge amount of virtio device people are
> crafting right now.

Given there are hardware implementation of virtio, I would refine that:
virtio is a silver bullet whenever we want a mature ring buffer
interface across architectures.  Being friendly to virtualization is by
now only a detail of virtio.  It is also not exclusive to virtio, for
example NVMe 1.3 has incorporated some ideas from Xen and virtio and is
also virtualization-friendly.

In turn, the ring buffer interface is great if you want to have mostly
asynchronous operation---if not, the ring buffer is just adding
complexity.  Sure, we have the luxury of abstractions and powerful
computers that hide most of the complexity, but some of it still lurks
in the form of race conditions.

So the question for virtio-mem is what makes asynchronous operation
important for memory hotplug?  If I understand the virtio-mem driver,
all interaction with the virtio device happens through a work item,
meaning that it is strictly synchronous.  At this point, you do not need
a ring buffer, you only need:

- a command register where you write the address of a command buffer.
The device will do DMA from the command block, do whatever it has to do,
DMA back the results, and trigger an interrupt.

- an interrupt mechanism.  It could be MSI, or it could be an interrupt
pending/interrupt acknowledge register if all the hardware offers is
level-triggered interrupts.

I do agree that virtio-mem's command buffer/DMA architecture is better
than the more traditional "bunch of hardware registers" architecture
that QEMU uses for its ACPI-based CPU and memory hotplug controllers.
But that's because command buffer/DMA is what actually defines a good
paravirtualized interface; virtio is a superset of that that may not be
always a good solution.

Paolo
David Hildenbrand Sept. 25, 2019, 10:50 a.m. UTC | #15
On 25.09.19 12:19, Paolo Bonzini wrote:
> This is a tangent, but I was a bit too harsh in my previous message (at
> least it made you laugh rather than angry!) so I think I owe you an
> explanation.

It's hard to make me really angry, you have to try better :) However,
after years of working on VMs, VM memory management and Linux MM, I
learned that things are horribly complicated - it's not obvious so I
can't expect all people to know what I learned.

> 
> On 25/09/19 10:44, David Hildenbrand wrote:
>> I consider virtio the silver bullet whenever we want a mature
>> paravirtualized interface across architectures. And you can tell that
>> I'm not the only one by the huge amount of virtio device people are
>> crafting right now.
> 
> Given there are hardware implementation of virtio, I would refine that:
> virtio is a silver bullet whenever we want a mature ring buffer
> interface across architectures.  Being friendly to virtualization is by
> now only a detail of virtio.  It is also not exclusive to virtio, for
> example NVMe 1.3 has incorporated some ideas from Xen and virtio and is
> also virtualization-friendly.
> 
> In turn, the ring buffer interface is great if you want to have mostly
> asynchronous operation---if not, the ring buffer is just adding
> complexity.  Sure, we have the luxury of abstractions and powerful
> computers that hide most of the complexity, but some of it still lurks
> in the form of race conditions.
> 
> So the question for virtio-mem is what makes asynchronous operation
> important for memory hotplug?  If I understand the virtio-mem driver,
> all interaction with the virtio device happens through a work item,
> meaning that it is strictly synchronous.  At this point, you do not need
> a ring buffer, you only need:

So, the main building pieces virtio-mem uses as of now in the virtio
infrastructure are the config space and one virtqueue.

a) A way for the host to send requests to the guest. E.g., request a
certain amount of memory to be plugged/unplugged by the guest. Done via
config space updates (e.g., similar to virtio-balloon
inflation/deflation requests).
b) A way for the guest to communicate with the host. E.g., send
plug/unplug requests to plug/unplug separate memory blocks. Done via a
virtqueue. Similar to inflation/deflation of pages in virtio-balloon.

Requests by the host via the config space are processed asynchronously
by the guest (again, similar to - say - virtio-balloon). Guest requests
are currently processed synchronously by the host.

Guest: Can I plug this block?
Host: Sorry, No can do.

Can't tell if there might be extensions (if virtio-mem ever comes to
life ;) ) that might make use of asynchronous communication. Especially,
there might be asynchronous/multiple guest->host requests at some point
(e.g., "I'm nearly out of memory, please send help").

So yes, currently we could live without the ring buffer. But the config
space and the virtqueue are real life-savers for me right now :)

> 
> - a command register where you write the address of a command buffer.
> The device will do DMA from the command block, do whatever it has to do,
> DMA back the results, and trigger an interrupt.
> 
> - an interrupt mechanism.  It could be MSI, or it could be an interrupt
> pending/interrupt acknowledge register if all the hardware offers is
> level-triggered interrupts.
> 
> I do agree that virtio-mem's command buffer/DMA architecture is better
> than the more traditional "bunch of hardware registers" architecture
> that QEMU uses for its ACPI-based CPU and memory hotplug controllers.
> But that's because command buffer/DMA is what actually defines a good
> paravirtualized interface; virtio is a superset of that that may not be
> always a good solution.
> 

I completely agree to what you say here, virtio comes with complexity,
but also with features (e.g., config space, support for multiple queues,
abstraction of transports).

Say, I would only want to expose a DIMM to the guest just like via ACPI,
virtio would clearly not be the right choice.
Paolo Bonzini Sept. 25, 2019, 11:24 a.m. UTC | #16
On 25/09/19 12:50, David Hildenbrand wrote:
> Can't tell if there might be extensions (if virtio-mem ever comes to
> life ;) ) that might make use of asynchronous communication. Especially,
> there might be asynchronous/multiple guest->host requests at some point
> (e.g., "I'm nearly out of memory, please send help").

Okay, this makes sense.  I'm almost sold on it. :)

Config space also makes sense, though what you really need is the config
space interrupt, rather than config space per se.

Paolo

> So yes, currently we could live without the ring buffer. But the config
> space and the virtqueue are real life-savers for me right now :)
David Hildenbrand Sept. 25, 2019, 11:32 a.m. UTC | #17
On 25.09.19 13:24, Paolo Bonzini wrote:
> On 25/09/19 12:50, David Hildenbrand wrote:
>> Can't tell if there might be extensions (if virtio-mem ever comes to
>> life ;) ) that might make use of asynchronous communication. Especially,
>> there might be asynchronous/multiple guest->host requests at some point
>> (e.g., "I'm nearly out of memory, please send help").
> 
> Okay, this makes sense.  I'm almost sold on it. :)
> 
> Config space also makes sense, though what you really need is the config
> space interrupt, rather than config space per se.
> 

Right, and feature negotiation is yet another nice-to-have thingy in the
virtio world :)

> Paolo
Philippe Mathieu-Daudé Sept. 25, 2019, 11:33 a.m. UTC | #18
On 9/25/19 7:51 AM, Sergio Lopez wrote:
> Peter Maydell <peter.maydell@linaro.org> writes:
> 
>> On Tue, 24 Sep 2019 at 14:25, Sergio Lopez <slp@redhat.com> wrote:
>>>
>>> Microvm is a machine type inspired by both NEMU and Firecracker, and
>>> constructed after the machine model implemented by the latter.
>>>
>>> It's main purpose is providing users a minimalist machine type free
>>> from the burden of legacy compatibility, serving as a stepping stone
>>> for future projects aiming at improving boot times, reducing the
>>> attack surface and slimming down QEMU's footprint.
>>
>>
>>>  docs/microvm.txt                 |  78 +++
>>
>> I'm not sure how close to acceptance this patchset is at the
>> moment, so not necessarily something you need to do now,
>> but could new documentation in docs/ be in rst format, not
>> plain text, please? (Ideally also they should be in the right
>> manual subdirectory, but documentation of system emulation
>> machines at the moment is still in texinfo format, so we
>> don't have a subdir for it yet.)
> 
> Sure. What I didn't get is, should I put it in "docs/microvm.rst" or in
> some other subdirectory?

Should we introduce docs/machines/?
Peter Maydell Sept. 25, 2019, 12:39 p.m. UTC | #19
On Wed, 25 Sep 2019 at 12:33, Philippe Mathieu-Daudé <philmd@redhat.com> wrote:
>
> On 9/25/19 7:51 AM, Sergio Lopez wrote:
> > Peter Maydell <peter.maydell@linaro.org> writes:
> >
> >> On Tue, 24 Sep 2019 at 14:25, Sergio Lopez <slp@redhat.com> wrote:
> >>>
> >>> Microvm is a machine type inspired by both NEMU and Firecracker, and
> >>> constructed after the machine model implemented by the latter.
> >>>
> >>> It's main purpose is providing users a minimalist machine type free
> >>> from the burden of legacy compatibility, serving as a stepping stone
> >>> for future projects aiming at improving boot times, reducing the
> >>> attack surface and slimming down QEMU's footprint.
> >>
> >>
> >>>  docs/microvm.txt                 |  78 +++
> >>
> >> I'm not sure how close to acceptance this patchset is at the
> >> moment, so not necessarily something you need to do now,
> >> but could new documentation in docs/ be in rst format, not
> >> plain text, please? (Ideally also they should be in the right
> >> manual subdirectory, but documentation of system emulation
> >> machines at the moment is still in texinfo format, so we
> >> don't have a subdir for it yet.)
> >
> > Sure. What I didn't get is, should I put it in "docs/microvm.rst" or in
> > some other subdirectory?
>
> Should we introduce docs/machines/?

This should live in the not-yet-created docs/system (the "system emulation
user's guide"), along with much of the content currently still in
the texinfo docs. But we don't have that structure yet and won't
until we do the texinfo conversion, so I think for the moment we
have two reasonable choices:
 (1) put it in the texinfo, so it is at least shipped to
     users until we get around to doing our docs conversion
 (2) leave it in docs/microvm.rst for now (we have a bunch
     of other docs in docs/ which are basically there because
     they're also awaiting the texinfo conversion and creation
     of the docs/user and docs/system manuals)

My ideal vision of how to do documentation of individual
machines, incidentally, would be to do it via doc comments
or some other kind of structured markup in the .c files
that define the machine, so that we could automatically
collect up the docs for the machines we're building,
put them in to per-architecture sections of the docs,
have autogenerated stub "this machine exists but isn't
documented yet" entries, etc. But that's not something that
we could easily do today so I don't want to block interim
improvements to our documentation just because I have some
nice theoretical idea for how it ought to work :-)

thanks
-- PMM
Christian Borntraeger Sept. 26, 2019, 7:48 a.m. UTC | #20
On 24.09.19 14:44, Sergio Lopez wrote:
> Microvm is a machine type inspired by both NEMU and Firecracker, and
> constructed after the machine model implemented by the latter.
> 
> It's main purpose is providing users a minimalist machine type free
> from the burden of legacy compatibility, serving as a stepping stone
> for future projects aiming at improving boot times, reducing the
> attack surface and slimming down QEMU's footprint.
> 
> The microvm machine type supports the following devices:
> 
>  - ISA bus
>  - i8259 PIC
>  - LAPIC (implicit if using KVM)
>  - IOAPIC (defaults to kernel_irqchip_split = true)
>  - i8254 PIT
>  - MC146818 RTC (optional)
>  - kvmclock (if using KVM)
>  - fw_cfg
>  - One ISA serial port (optional)
>  - Up to eight virtio-mmio devices (configured by the user)

Just out of curiosity. 
What is the reason for not going virtio-pci? Is the PCI bus really
that expensive and complicated?
FWIW, I do not complain. When people start using virtio-mmio more
often this would also help virtio-ccw (which I am interested in)
as this forces people to think beyond virtio-pci.
Sergio Lopez Pascual Sept. 26, 2019, 8:22 a.m. UTC | #21
Christian Borntraeger <borntraeger@de.ibm.com> writes:

> On 24.09.19 14:44, Sergio Lopez wrote:
>> Microvm is a machine type inspired by both NEMU and Firecracker, and
>> constructed after the machine model implemented by the latter.
>> 
>> It's main purpose is providing users a minimalist machine type free
>> from the burden of legacy compatibility, serving as a stepping stone
>> for future projects aiming at improving boot times, reducing the
>> attack surface and slimming down QEMU's footprint.
>> 
>> The microvm machine type supports the following devices:
>> 
>>  - ISA bus
>>  - i8259 PIC
>>  - LAPIC (implicit if using KVM)
>>  - IOAPIC (defaults to kernel_irqchip_split = true)
>>  - i8254 PIT
>>  - MC146818 RTC (optional)
>>  - kvmclock (if using KVM)
>>  - fw_cfg
>>  - One ISA serial port (optional)
>>  - Up to eight virtio-mmio devices (configured by the user)
>
> Just out of curiosity. 
> What is the reason for not going virtio-pci? Is the PCI bus really
> that expensive and complicated?

Well, expensive is a relative term. PCI does indeed require a
significant amount of code and cycles, but that's for a good reason, as
it provides an extensive bus logic allowing things like vector
configuration, hot-plug, chaining, etc...

On the other hand, MMIO lacks any kind of bus logic, as it basically
works by saying "hey, take a look at this address, there may be
something there" to the kernel, so of course is cheaper. This makes it
ideal for microvm's aim of supporting a VM with the smallest amount of
code, but bad for almost everything else.

I don't think this means PCI is expensive. That would be the case if
there were a bus providing similar functionality while requiring less
code and cycles. And this is definitely not the case of MMIO.

In other words, I think PCI cost is justified by its use case, while
MMIO simplicity makes it ideal for some specific purposes (like
microvm).

Cheers,
Sergio.

> FWIW, I do not complain. When people start using virtio-mmio more
> often this would also help virtio-ccw (which I am interested in)
> as this forces people to think beyond virtio-pci.