mbox series

[v4,0/8] Introduce the microvm machine type

Message ID 20190924124433.96810-1-slp@redhat.com (mailing list archive)
Headers show
Series Introduce the microvm machine type | expand

Message

Sergio Lopez Pascual Sept. 24, 2019, 12:44 p.m. UTC
Microvm is a machine type inspired by both NEMU and Firecracker, and
constructed after the machine model implemented by the latter.

It's main purpose is providing users a minimalist machine type free
from the burden of legacy compatibility, serving as a stepping stone
for future projects aiming at improving boot times, reducing the
attack surface and slimming down QEMU's footprint.

The microvm machine type supports the following devices:

 - ISA bus
 - i8259 PIC
 - LAPIC (implicit if using KVM)
 - IOAPIC (defaults to kernel_irqchip_split = true)
 - i8254 PIT
 - MC146818 RTC (optional)
 - kvmclock (if using KVM)
 - fw_cfg
 - One ISA serial port (optional)
 - Up to eight virtio-mmio devices (configured by the user)

It supports the following machine-specific options:

microvm.option-roms=bool (Set off to disable loading option ROMs)
microvm.isa-serial=bool (Set off to disable the instantiation an ISA serial port)
microvm.rtc=bool (Set off to disable the instantiation of an MC146818 RTC)
microvm.kernel-cmdline=bool (Set off to disable adding virtio-mmio devices to the kernel cmdline)

By default, microvm uses qboot as its BIOS, to obtain better boot
times, but it's also compatible with SeaBIOS.

As no current FW is able to boot from a block device using virtio-mmio
as its transport, a microvm-based VM needs to be run using a host-side
kernel and, optionally, an initrd image.

This is an example of instantiating a microvm VM with a virtio-mmio
based console:

qemu-system-x86_64 -M microvm
 -enable-kvm -cpu host -m 512m -smp 2 \
 -kernel vmlinux -append "console=hvc0 root=/dev/vda" \
 -nodefaults -no-user-config -nographic \
 -chardev stdio,id=virtiocon0,server \
 -device virtio-serial-device \
 -device virtconsole,chardev=virtiocon0 \
 -drive id=test,file=test.img,format=raw,if=none \
 -device virtio-blk-device,drive=test \
 -netdev tap,id=tap0,script=no,downscript=no \
 -device virtio-net-device,netdev=tap0

This is another example, this time using an ISA serial port, useful
for debugging purposes:

qemu-system-x86_64 -M microvm \
 -enable-kvm -cpu host -m 512m -smp 2 \
 -kernel vmlinux -append "earlyprintk=ttyS0 console=ttyS0 root=/dev/vda" \
 -nodefaults -no-user-config -nographic \
 -serial stdio \
 -drive id=test,file=test.img,format=raw,if=none \
 -device virtio-blk-device,drive=test \
 -netdev tap,id=tap0,script=no,downscript=no \
 -device virtio-net-device,netdev=tap0

Finally, in this example a microvm VM is instantiated without RTC,
without an ISA serial port and without loading the option ROMs,
obtaining the smallest configuration:

qemu-system-x86_64 -M microvm,rtc=off,isa-serial=off,option-roms=off \
 -enable-kvm -cpu host -m 512m -smp 2 \
 -kernel vmlinux -append "console=hvc0 root=/dev/vda" \
 -nodefaults -no-user-config -nographic \
 -chardev stdio,id=virtiocon0,server \
 -device virtio-serial-device \
 -device virtconsole,chardev=virtiocon0 \
 -drive id=test,file=test.img,format=raw,if=none \
 -device virtio-blk-device,drive=test \
 -netdev tap,id=tap0,script=no,downscript=no \
 -device virtio-net-device,netdev=tap0

---

Changelog
v4:
 - This is a complete rewrite of the whole patchset, with a focus on
   reusing as much existing code as possible to ease the maintenance burden
   and making the machine type as compatible as possible by default. As
   a result, the number of lines dedicated specifically to microvm is
   383 (code lines measured by "cloc") and, with the default
   configuration, it's now able to boot both PVH ELF images and
   bzImages with either SeaBIOS or qboot.

v3:
  - Add initrd support (thanks Stefano).

v2:
  - Drop "[PATCH 1/4] hw/i386: Factorize CPU routine".
  - Simplify machine definition (thanks Eduardo).
  - Remove use of unneeded NUMA-related callbacks (thanks Eduardo).
  - Add a patch to factorize PVH-related functions.
  - Replace use of Linux's Zero Page with PVH (thanks Maran and Paolo).
  
---
Sergio Lopez (8):
  hw/i386: Factorize PVH related functions
  hw/i386: Factorize e820 related functions
  hw/virtio: Factorize virtio-mmio headers
  hw/i386: split PCMachineState deriving X86MachineState from it
  fw_cfg: add "modify" functions for all types
  roms: add microvm-bios (qboot) as binary and git submodule
  docs/microvm.txt: document the new microvm machine type
  hw/i386: Introduce the microvm machine type

 .gitmodules                      |   3 +
 default-configs/i386-softmmu.mak |   1 +
 docs/microvm.txt                 |  78 +++
 hw/acpi/cpu_hotplug.c            |  10 +-
 hw/i386/Kconfig                  |   4 +
 hw/i386/Makefile.objs            |   4 +
 hw/i386/acpi-build.c             |  31 +-
 hw/i386/amd_iommu.c              |   4 +-
 hw/i386/e820.c                   |  99 ++++
 hw/i386/e820.h                   |  11 +
 hw/i386/intel_iommu.c            |   4 +-
 hw/i386/microvm.c                | 512 +++++++++++++++++
 hw/i386/pc.c                     | 960 +++----------------------------
 hw/i386/pc_piix.c                |  48 +-
 hw/i386/pc_q35.c                 |  38 +-
 hw/i386/pc_sysfw.c               |  60 +-
 hw/i386/pvh.c                    | 113 ++++
 hw/i386/pvh.h                    |  10 +
 hw/i386/x86.c                    | 788 +++++++++++++++++++++++++
 hw/intc/ioapic.c                 |   3 +-
 hw/nvram/fw_cfg.c                |  29 +
 hw/virtio/virtio-mmio.c          |  35 +-
 include/hw/i386/microvm.h        |  80 +++
 include/hw/i386/pc.h             |  40 +-
 include/hw/i386/x86.h            |  97 ++++
 include/hw/nvram/fw_cfg.h        |  42 ++
 include/hw/virtio/virtio-mmio.h  |  60 ++
 pc-bios/bios-microvm.bin         | Bin 0 -> 65536 bytes
 roms/Makefile                    |   6 +
 roms/qboot                       |   1 +
 target/i386/kvm.c                |   1 +
 31 files changed, 2102 insertions(+), 1070 deletions(-)
 create mode 100644 docs/microvm.txt
 create mode 100644 hw/i386/e820.c
 create mode 100644 hw/i386/e820.h
 create mode 100644 hw/i386/microvm.c
 create mode 100644 hw/i386/pvh.c
 create mode 100644 hw/i386/pvh.h
 create mode 100644 hw/i386/x86.c
 create mode 100644 include/hw/i386/microvm.h
 create mode 100644 include/hw/i386/x86.h
 create mode 100644 include/hw/virtio/virtio-mmio.h
 create mode 100755 pc-bios/bios-microvm.bin
 create mode 160000 roms/qboot

Comments

Paolo Bonzini Sept. 24, 2019, 1:10 p.m. UTC | #1
On 24/09/19 14:44, Sergio Lopez wrote:
> +Microvm is a machine type inspired by both NEMU and Firecracker, and
> +constructed after the machine model implemented by the latter.

I would say it's inspired by Firecracker only.  The NEMU virt machine
had virtio-pci and ACPI.

> +It's main purpose is providing users a minimalist machine type free
> +from the burden of legacy compatibility,

I think this is too strong, especially if you keep the PIC and PIT. :)
Maybe just "It's a minimalist machine type without PCI support designed
for short-lived guests".

> +serving as a stepping stone
> +for future projects aiming at improving boot times, reducing the
> +attack surface and slimming down QEMU's footprint.

"Microvm also establishes a baseline for benchmarking QEMU and operating
systems, since it is optimized for both boot time and footprint".

> +The microvm machine type supports the following devices:
> +
> + - ISA bus
> + - i8259 PIC
> + - LAPIC (implicit if using KVM)
> + - IOAPIC (defaults to kernel_irqchip_split = true)
> + - i8254 PIT

Do we need the PIT?  And perhaps the PIC even?

Paolo

> + - MC146818 RTC (optional)
> + - kvmclock (if using KVM)
> + - fw_cfg
> + - One ISA serial port (optional)
> + - Up to eight virtio-mmio devices (configured by the user)
> +
Peter Maydell Sept. 24, 2019, 1:57 p.m. UTC | #2
On Tue, 24 Sep 2019 at 14:25, Sergio Lopez <slp@redhat.com> wrote:
>
> Microvm is a machine type inspired by both NEMU and Firecracker, and
> constructed after the machine model implemented by the latter.
>
> It's main purpose is providing users a minimalist machine type free
> from the burden of legacy compatibility, serving as a stepping stone
> for future projects aiming at improving boot times, reducing the
> attack surface and slimming down QEMU's footprint.


>  docs/microvm.txt                 |  78 +++

I'm not sure how close to acceptance this patchset is at the
moment, so not necessarily something you need to do now,
but could new documentation in docs/ be in rst format, not
plain text, please? (Ideally also they should be in the right
manual subdirectory, but documentation of system emulation
machines at the moment is still in texinfo format, so we
don't have a subdir for it yet.)

thanks
-- PMM
Sergio Lopez Pascual Sept. 25, 2019, 5:49 a.m. UTC | #3
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 24/09/19 14:44, Sergio Lopez wrote:
>> +Microvm is a machine type inspired by both NEMU and Firecracker, and
>> +constructed after the machine model implemented by the latter.
>
> I would say it's inspired by Firecracker only.  The NEMU virt machine
> had virtio-pci and ACPI.

Actually, the NEMU reference comes from the fact that, originally,
microvm.c code was based on virt.c, but on v4 all that is already gone,
so it makes sense to remove the reference.

>> +It's main purpose is providing users a minimalist machine type free
>> +from the burden of legacy compatibility,
>
> I think this is too strong, especially if you keep the PIC and PIT. :)
> Maybe just "It's a minimalist machine type without PCI support designed
> for short-lived guests".

OK.

>> +serving as a stepping stone
>> +for future projects aiming at improving boot times, reducing the
>> +attack surface and slimming down QEMU's footprint.
>
> "Microvm also establishes a baseline for benchmarking QEMU and operating
> systems, since it is optimized for both boot time and footprint".

Well, I prefer my paragraph, but I'm good with either.

>> +The microvm machine type supports the following devices:
>> +
>> + - ISA bus
>> + - i8259 PIC
>> + - LAPIC (implicit if using KVM)
>> + - IOAPIC (defaults to kernel_irqchip_split = true)
>> + - i8254 PIT
>
> Do we need the PIT?  And perhaps the PIC even?

We need the PIT for non-KVM accel (if present with KVM and
kernel_irqchip_split = off, it basically becomes a placeholder), and the
PIC for both the PIT and the ISA serial port.

Thanks,
Sergio.

>> + - MC146818 RTC (optional)
>> + - kvmclock (if using KVM)
>> + - fw_cfg
>> + - One ISA serial port (optional)
>> + - Up to eight virtio-mmio devices (configured by the user)
>> +
Sergio Lopez Pascual Sept. 25, 2019, 5:51 a.m. UTC | #4
Peter Maydell <peter.maydell@linaro.org> writes:

> On Tue, 24 Sep 2019 at 14:25, Sergio Lopez <slp@redhat.com> wrote:
>>
>> Microvm is a machine type inspired by both NEMU and Firecracker, and
>> constructed after the machine model implemented by the latter.
>>
>> It's main purpose is providing users a minimalist machine type free
>> from the burden of legacy compatibility, serving as a stepping stone
>> for future projects aiming at improving boot times, reducing the
>> attack surface and slimming down QEMU's footprint.
>
>
>>  docs/microvm.txt                 |  78 +++
>
> I'm not sure how close to acceptance this patchset is at the
> moment, so not necessarily something you need to do now,
> but could new documentation in docs/ be in rst format, not
> plain text, please? (Ideally also they should be in the right
> manual subdirectory, but documentation of system emulation
> machines at the moment is still in texinfo format, so we
> don't have a subdir for it yet.)

Sure. What I didn't get is, should I put it in "docs/microvm.rst" or in
some other subdirectory?

Thanks,
Sergio.
David Hildenbrand Sept. 25, 2019, 7:41 a.m. UTC | #5
On 24.09.19 14:44, Sergio Lopez wrote:
> Microvm is a machine type inspired by both NEMU and Firecracker, and
> constructed after the machine model implemented by the latter.
> 
> It's main purpose is providing users a minimalist machine type free
> from the burden of legacy compatibility, serving as a stepping stone
> for future projects aiming at improving boot times, reducing the
> attack surface and slimming down QEMU's footprint.
> 
> The microvm machine type supports the following devices:
> 
>  - ISA bus
>  - i8259 PIC
>  - LAPIC (implicit if using KVM)
>  - IOAPIC (defaults to kernel_irqchip_split = true)
>  - i8254 PIT
>  - MC146818 RTC (optional)
>  - kvmclock (if using KVM)
>  - fw_cfg
>  - One ISA serial port (optional)
>  - Up to eight virtio-mmio devices (configured by the user)

So I assume also no ACPI (CPU/memory hotplug), correct?

@Pankaj, I think it would make sense to make virtio-pmem play with
virtio-mmio/microvm.
Paolo Bonzini Sept. 25, 2019, 7:57 a.m. UTC | #6
On 25/09/19 07:49, Sergio Lopez wrote:
>>> +serving as a stepping stone
>>> +for future projects aiming at improving boot times, reducing the
>>> +attack surface and slimming down QEMU's footprint.
>>
>> "Microvm also establishes a baseline for benchmarking QEMU and operating
>> systems, since it is optimized for both boot time and footprint".
> 
> Well, I prefer my paragraph, but I'm good with either.

You're right my version sort of missed the point.  What about
s/benchmarking/benchmarking and optimizing/?

>>> +The microvm machine type supports the following devices:
>>> +
>>> + - ISA bus
>>> + - i8259 PIC
>>> + - LAPIC (implicit if using KVM)
>>> + - IOAPIC (defaults to kernel_irqchip_split = true)
>>> + - i8254 PIT
>>
>> Do we need the PIT?  And perhaps the PIC even?
> 
> We need the PIT for non-KVM accel (if present with KVM and
> kernel_irqchip_split = off, it basically becomes a placeholder)

Why?

> and the
> PIC for both the PIT and the ISA serial port.

Can't the ISA serial port work with the IOAPIC?

Paolo
Pankaj Gupta Sept. 25, 2019, 7:58 a.m. UTC | #7
> On 24.09.19 14:44, Sergio Lopez wrote:
> > Microvm is a machine type inspired by both NEMU and Firecracker, and
> > constructed after the machine model implemented by the latter.
> > 
> > It's main purpose is providing users a minimalist machine type free
> > from the burden of legacy compatibility, serving as a stepping stone
> > for future projects aiming at improving boot times, reducing the
> > attack surface and slimming down QEMU's footprint.
> > 
> > The microvm machine type supports the following devices:
> > 
> >  - ISA bus
> >  - i8259 PIC
> >  - LAPIC (implicit if using KVM)
> >  - IOAPIC (defaults to kernel_irqchip_split = true)
> >  - i8254 PIT
> >  - MC146818 RTC (optional)
> >  - kvmclock (if using KVM)
> >  - fw_cfg
> >  - One ISA serial port (optional)
> >  - Up to eight virtio-mmio devices (configured by the user)
> 
> So I assume also no ACPI (CPU/memory hotplug), correct?
> 
> @Pankaj, I think it would make sense to make virtio-pmem play with
> virtio-mmio/microvm.

I agree. Its using virtio-blk device over a raw image.
Similarly or alternatively(as an experiment) we can use virtio-pmem
which will even reduce the guest memory footprint. 

Best regards,
Pankaj

> 
> --
> 
> Thanks,
> 
> David / dhildenb
>
Sergio Lopez Pascual Sept. 25, 2019, 8:10 a.m. UTC | #8
David Hildenbrand <david@redhat.com> writes:

> On 24.09.19 14:44, Sergio Lopez wrote:
>> Microvm is a machine type inspired by both NEMU and Firecracker, and
>> constructed after the machine model implemented by the latter.
>> 
>> It's main purpose is providing users a minimalist machine type free
>> from the burden of legacy compatibility, serving as a stepping stone
>> for future projects aiming at improving boot times, reducing the
>> attack surface and slimming down QEMU's footprint.
>> 
>> The microvm machine type supports the following devices:
>> 
>>  - ISA bus
>>  - i8259 PIC
>>  - LAPIC (implicit if using KVM)
>>  - IOAPIC (defaults to kernel_irqchip_split = true)
>>  - i8254 PIT
>>  - MC146818 RTC (optional)
>>  - kvmclock (if using KVM)
>>  - fw_cfg
>>  - One ISA serial port (optional)
>>  - Up to eight virtio-mmio devices (configured by the user)
>
> So I assume also no ACPI (CPU/memory hotplug), correct?

Correct.

> @Pankaj, I think it would make sense to make virtio-pmem play with
> virtio-mmio/microvm.

That would be great. I'm also looking forward for virtio-mem (and an
hypothetical virtio-cpu) to eventually gain hotplug capabilities in
microvm.

Thanks,
Sergio.
David Hildenbrand Sept. 25, 2019, 8:16 a.m. UTC | #9
On 25.09.19 10:10, Sergio Lopez wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 24.09.19 14:44, Sergio Lopez wrote:
>>> Microvm is a machine type inspired by both NEMU and Firecracker, and
>>> constructed after the machine model implemented by the latter.
>>>
>>> It's main purpose is providing users a minimalist machine type free
>>> from the burden of legacy compatibility, serving as a stepping stone
>>> for future projects aiming at improving boot times, reducing the
>>> attack surface and slimming down QEMU's footprint.
>>>
>>> The microvm machine type supports the following devices:
>>>
>>>  - ISA bus
>>>  - i8259 PIC
>>>  - LAPIC (implicit if using KVM)
>>>  - IOAPIC (defaults to kernel_irqchip_split = true)
>>>  - i8254 PIT
>>>  - MC146818 RTC (optional)
>>>  - kvmclock (if using KVM)
>>>  - fw_cfg
>>>  - One ISA serial port (optional)
>>>  - Up to eight virtio-mmio devices (configured by the user)
>>
>> So I assume also no ACPI (CPU/memory hotplug), correct?
> 
> Correct.
> 
>> @Pankaj, I think it would make sense to make virtio-pmem play with
>> virtio-mmio/microvm.
> 
> That would be great. I'm also looking forward for virtio-mem (and an
> hypothetical virtio-cpu) to eventually gain hotplug capabilities in
> microvm.

@Pankaj, do you have time to look into the virtio-pmem thingy? I guess
the virtio-mmio rapper shouldn't be too hard (very similar to the
virtio-pci wrapper - luckily I insisted to make it work independently
from PCI BARs and ACPI slots ;) ). The microvm bits would be properly
setting up device memory and wiring up the hotplug handlers, similar as
done in the other PC machine types (maybe that comes for free?).

virtio-pmem will allow (in read-only mode) to place the rootfs on a fake
NVDIMM, as done e.g., in kata containers. We might have to include the
virtio-pmem kernel module in the initramfs, shouldn't  be too hard. Not
sure what else we'll need to make virtio-pmem get used as a rootfs.

> 
> Thanks,
> Sergio.
>
Paolo Bonzini Sept. 25, 2019, 8:26 a.m. UTC | #10
On 25/09/19 10:10, Sergio Lopez wrote:
> That would be great. I'm also looking forward for virtio-mem (and an
> hypothetical virtio-cpu) to eventually gain hotplug capabilities in
> microvm.

I disagree with this.  virtio is not a silver bullet (and in fact
perhaps it's just me but I've never understood the advantages of
virtio-mem over anything else).

If you want to add hotplug to microvm, you can reuse the existing code
for CPU and memory hotplug controllers, and write drivers for them in
Linux's drivers/platform.  The drivers would basically do what the ACPI
AML tells the interpreter to do.

There is no reason to add the complexity of virtio to something as
low-level and deadlock-prone as CPU hotplug.

Paolo
Pankaj Gupta Sept. 25, 2019, 8:37 a.m. UTC | #11
> >>> Microvm is a machine type inspired by both NEMU and Firecracker, and
> >>> constructed after the machine model implemented by the latter.
> >>>
> >>> It's main purpose is providing users a minimalist machine type free
> >>> from the burden of legacy compatibility, serving as a stepping stone
> >>> for future projects aiming at improving boot times, reducing the
> >>> attack surface and slimming down QEMU's footprint.
> >>>
> >>> The microvm machine type supports the following devices:
> >>>
> >>>  - ISA bus
> >>>  - i8259 PIC
> >>>  - LAPIC (implicit if using KVM)
> >>>  - IOAPIC (defaults to kernel_irqchip_split = true)
> >>>  - i8254 PIT
> >>>  - MC146818 RTC (optional)
> >>>  - kvmclock (if using KVM)
> >>>  - fw_cfg
> >>>  - One ISA serial port (optional)
> >>>  - Up to eight virtio-mmio devices (configured by the user)
> >>
> >> So I assume also no ACPI (CPU/memory hotplug), correct?
> > 
> > Correct.
> > 
> >> @Pankaj, I think it would make sense to make virtio-pmem play with
> >> virtio-mmio/microvm.
> > 
> > That would be great. I'm also looking forward for virtio-mem (and an
> > hypothetical virtio-cpu) to eventually gain hotplug capabilities in
> > microvm.
> 
> @Pankaj, do you have time to look into the virtio-pmem thingy? I guess
> the virtio-mmio rapper shouldn't be too hard (very similar to the
> virtio-pci wrapper - luckily I insisted to make it work independently
> from PCI BARs and ACPI slots ;) ). The microvm bits would be properly
> setting up device memory and wiring up the hotplug handlers, similar as
> done in the other PC machine types (maybe that comes for free?).

Yes, I can look at.

> 
> virtio-pmem will allow (in read-only mode) to place the rootfs on a fake
> NVDIMM, as done e.g., in kata containers. We might have to include the
> virtio-pmem kernel module in the initramfs, shouldn't  be too hard. Not
> sure what else we'll need to make virtio-pmem get used as a rootfs.

Sure, will work on it.

Thanks,
Pankaj

> 
> > 
> > Thanks,
> > Sergio.
> > 
> 
> 
> --
> 
> Thanks,
> 
> David / dhildenb
> 
>
Sergio Lopez Pascual Sept. 25, 2019, 8:40 a.m. UTC | #12
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 25/09/19 07:49, Sergio Lopez wrote:
>>>> +serving as a stepping stone
>>>> +for future projects aiming at improving boot times, reducing the
>>>> +attack surface and slimming down QEMU's footprint.
>>>
>>> "Microvm also establishes a baseline for benchmarking QEMU and operating
>>> systems, since it is optimized for both boot time and footprint".
>> 
>> Well, I prefer my paragraph, but I'm good with either.
>
> You're right my version sort of missed the point.  What about
> s/benchmarking/benchmarking and optimizing/?
>
>>>> +The microvm machine type supports the following devices:
>>>> +
>>>> + - ISA bus
>>>> + - i8259 PIC
>>>> + - LAPIC (implicit if using KVM)
>>>> + - IOAPIC (defaults to kernel_irqchip_split = true)
>>>> + - i8254 PIT
>>>
>>> Do we need the PIT?  And perhaps the PIC even?
>> 
>> We need the PIT for non-KVM accel (if present with KVM and
>> kernel_irqchip_split = off, it basically becomes a placeholder)
>
> Why?

Perhaps I'm missing something. Is some other device supposed to be
acting as a HW timer while running with TCG acceleration?

>> and the
>> PIC for both the PIT and the ISA serial port.
>
> Can't the ISA serial port work with the IOAPIC?

Hm... I'm not sure. I wanted to give it a try, but then noticed that
multiple places in the code (like hw/intc/apic.c:560) do expect to have
an ISA PIC present through the isa_pic global variable.

I guess we should be able to work around this, but I'm not sure if it's
really worth it. What do you think?

Thanks,
Sergio.
Sergio Lopez Pascual Sept. 25, 2019, 8:42 a.m. UTC | #13
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 25/09/19 10:10, Sergio Lopez wrote:
>> That would be great. I'm also looking forward for virtio-mem (and an
>> hypothetical virtio-cpu) to eventually gain hotplug capabilities in
>> microvm.
>
> I disagree with this.  virtio is not a silver bullet (and in fact
> perhaps it's just me but I've never understood the advantages of
> virtio-mem over anything else).
>
> If you want to add hotplug to microvm, you can reuse the existing code
> for CPU and memory hotplug controllers, and write drivers for them in
> Linux's drivers/platform.  The drivers would basically do what the ACPI
> AML tells the interpreter to do.
>
> There is no reason to add the complexity of virtio to something as
> low-level and deadlock-prone as CPU hotplug.

TBH, I haven't put much thought into this yet. I'll keep this in mind
for the future.

Thanks,
Sergio.
David Hildenbrand Sept. 25, 2019, 8:44 a.m. UTC | #14
On 25.09.19 10:26, Paolo Bonzini wrote:
> On 25/09/19 10:10, Sergio Lopez wrote:
>> That would be great. I'm also looking forward for virtio-mem (and an
>> hypothetical virtio-cpu) to eventually gain hotplug capabilities in
>> microvm.
> 
> I disagree with this.  virtio is not a silver bullet (and in fact
> perhaps it's just me but I've never understood the advantages of
> virtio-mem over anything else).

Sorry, I had to lol about "virtio-mem over anything else". No, not
starting a discussion.

> 
> If you want to add hotplug to microvm, you can reuse the existing code
> for CPU and memory hotplug controllers, and write drivers for them in
> Linux's drivers/platform.  The drivers would basically do what the ACPI
> AML tells the interpreter to do.
> 
> There is no reason to add the complexity of virtio to something as
> low-level and deadlock-prone as CPU hotplug.

I do agree in respect of CPU hotplug complexity (especially accross
architectures), but thinking "outside of the wonderful x86 world", other
architectures impose limitations (e.g., no cpu unplug on s390x - at
least for now) that make something like this very interesting. But yeah,
I already expressed somewhere else my feelings about CPU hotplug.

I consider virtio the silver bullet whenever we want a mature
paravirtualized interface across architectures. And you can tell that
I'm not the only one by the huge amount of virtio device people are
crafting right now.
Gerd Hoffmann Sept. 25, 2019, 9:12 a.m. UTC | #15
Hi,

> If you want to add hotplug to microvm, you can reuse the existing code
> for CPU and memory hotplug controllers, and write drivers for them in
> Linux's drivers/platform.  The drivers would basically do what the ACPI
> AML tells the interpreter to do.

How would the linux kernel detect those devices?

I guess that wouldn't be ACPI, seems everyone wants avoid it[1].

So device tree on x86?  Something else?

cheers,
  Gerd

[1] Not clear to me why, some minimal ACPI tables listing our
    devices (isa-serial, fw_cfg, ...) doesn't look unreasonable
    to me.  We could also make virtio-mmio discoverable that way.
    Also we could do acpi cpu hotplug without having to write those
    linux platform drivers.  We would need a sysbus-acpi device though,
    but given that most acpi code is already separated out so piix and
    q35 can share it it should not be that hard to wire up.
Paolo Bonzini Sept. 25, 2019, 9:22 a.m. UTC | #16
On 25/09/19 10:40, Sergio Lopez wrote:
>>> We need the PIT for non-KVM accel (if present with KVM and
>>> kernel_irqchip_split = off, it basically becomes a placeholder)
>> Why?
> 
> Perhaps I'm missing something. Is some other device supposed to be
> acting as a HW timer while running with TCG acceleration?

Sure, the LAPIC timer.  I wonder if Linux, however, wants to use the PIT
in order to calibrate the LAPIC timer if TSC deadline mode is unavailable.

>>> and the PIC for both the PIT and the ISA serial port.
>>
>> Can't the ISA serial port work with the IOAPIC?
> 
> Hm... I'm not sure. I wanted to give it a try, but then noticed that
> multiple places in the code (like hw/intc/apic.c:560) do expect to have
> an ISA PIC present through the isa_pic global variable.
> 
> I guess we should be able to work around this, but I'm not sure if it's
> really worth it. What do you think?

You can add a paragraph saying that in the future the list could be
reduced further.  I think that the direction we want to go is to only
leave the IOAPIC around (the ISA devices in this respect are no
different from the virtio-mmio devices).

But you're right about isa_pic.  I wonder if it's as easy as this:

diff --git a/hw/intc/apic.c b/hw/intc/apic.c
index bce89911dc..5d03e48a19 100644
--- a/hw/intc/apic.c
+++ b/hw/intc/apic.c
@@ -610,7 +610,7 @@ int apic_accept_pic_intr(DeviceState *dev)

     if ((s->apicbase & MSR_IA32_APICBASE_ENABLE) == 0 ||
         (lvt0 & APIC_LVT_MASKED) == 0)
-        return 1;
+        return isa_pic != NULL;

     return 0;
 }

Thanks,

Paolo
Paolo Bonzini Sept. 25, 2019, 9:29 a.m. UTC | #17
On 25/09/19 11:12, Gerd Hoffmann wrote:
>   Hi,
> 
>> If you want to add hotplug to microvm, you can reuse the existing code
>> for CPU and memory hotplug controllers, and write drivers for them in
>> Linux's drivers/platform.  The drivers would basically do what the ACPI
>> AML tells the interpreter to do.
> 
> How would the linux kernel detect those devices?
>
> I guess that wouldn't be ACPI, seems everyone wants avoid it[1].
> 
> So device tree on x86?  Something else?

Yes, device tree would be great.

> [1] Not clear to me why, some minimal ACPI tables listing our
>     devices (isa-serial, fw_cfg, ...) doesn't look unreasonable
>     to me.

It's not, but ACPI is dog slow and half of the boot time is cut if you
remove it.

> We could also make virtio-mmio discoverable that way.

True, but the simplest way to plumb virtio-mmio into ACPI would be
taking the device tree properties and representing them as _DSD[1].  So
at this point it's just as easy to use directly the device tree.

Paolo

[1]
https://kernel-recipes.org/en/2015/talks/representing-device-tree-peripherals-in-acpi/
David Hildenbrand Sept. 25, 2019, 9:47 a.m. UTC | #18
On 25.09.19 11:12, Gerd Hoffmann wrote:
>   Hi,
> 
>> If you want to add hotplug to microvm, you can reuse the existing code
>> for CPU and memory hotplug controllers, and write drivers for them in
>> Linux's drivers/platform.  The drivers would basically do what the ACPI
>> AML tells the interpreter to do.
> 
> How would the linux kernel detect those devices?
> 
> I guess that wouldn't be ACPI, seems everyone wants avoid it[1].
> 
> So device tree on x86?  Something else?
> 
> cheers,
>   Gerd
> 
> [1] Not clear to me why, some minimal ACPI tables listing our
>     devices (isa-serial, fw_cfg, ...) doesn't look unreasonable
>     to me.  We could also make virtio-mmio discoverable that way.
>     Also we could do acpi cpu hotplug without having to write those
>     linux platform drivers.  We would need a sysbus-acpi device though,
>     but given that most acpi code is already separated out so piix and
>     q35 can share it it should not be that hard to wire up.
> 

Just to make one thing clear, the same could be used for DIMM based
memory hotplug, too. virtio-mem is not simply exposing DIMMs to a guest
using virtio. It's even designed to co-exist with DIMM based memory hotplug.
Paolo Bonzini Sept. 25, 2019, 10:19 a.m. UTC | #19
This is a tangent, but I was a bit too harsh in my previous message (at
least it made you laugh rather than angry!) so I think I owe you an
explanation.

On 25/09/19 10:44, David Hildenbrand wrote:
> I consider virtio the silver bullet whenever we want a mature
> paravirtualized interface across architectures. And you can tell that
> I'm not the only one by the huge amount of virtio device people are
> crafting right now.

Given there are hardware implementation of virtio, I would refine that:
virtio is a silver bullet whenever we want a mature ring buffer
interface across architectures.  Being friendly to virtualization is by
now only a detail of virtio.  It is also not exclusive to virtio, for
example NVMe 1.3 has incorporated some ideas from Xen and virtio and is
also virtualization-friendly.

In turn, the ring buffer interface is great if you want to have mostly
asynchronous operation---if not, the ring buffer is just adding
complexity.  Sure, we have the luxury of abstractions and powerful
computers that hide most of the complexity, but some of it still lurks
in the form of race conditions.

So the question for virtio-mem is what makes asynchronous operation
important for memory hotplug?  If I understand the virtio-mem driver,
all interaction with the virtio device happens through a work item,
meaning that it is strictly synchronous.  At this point, you do not need
a ring buffer, you only need:

- a command register where you write the address of a command buffer.
The device will do DMA from the command block, do whatever it has to do,
DMA back the results, and trigger an interrupt.

- an interrupt mechanism.  It could be MSI, or it could be an interrupt
pending/interrupt acknowledge register if all the hardware offers is
level-triggered interrupts.

I do agree that virtio-mem's command buffer/DMA architecture is better
than the more traditional "bunch of hardware registers" architecture
that QEMU uses for its ACPI-based CPU and memory hotplug controllers.
But that's because command buffer/DMA is what actually defines a good
paravirtualized interface; virtio is a superset of that that may not be
always a good solution.

Paolo
David Hildenbrand Sept. 25, 2019, 10:50 a.m. UTC | #20
On 25.09.19 12:19, Paolo Bonzini wrote:
> This is a tangent, but I was a bit too harsh in my previous message (at
> least it made you laugh rather than angry!) so I think I owe you an
> explanation.

It's hard to make me really angry, you have to try better :) However,
after years of working on VMs, VM memory management and Linux MM, I
learned that things are horribly complicated - it's not obvious so I
can't expect all people to know what I learned.

> 
> On 25/09/19 10:44, David Hildenbrand wrote:
>> I consider virtio the silver bullet whenever we want a mature
>> paravirtualized interface across architectures. And you can tell that
>> I'm not the only one by the huge amount of virtio device people are
>> crafting right now.
> 
> Given there are hardware implementation of virtio, I would refine that:
> virtio is a silver bullet whenever we want a mature ring buffer
> interface across architectures.  Being friendly to virtualization is by
> now only a detail of virtio.  It is also not exclusive to virtio, for
> example NVMe 1.3 has incorporated some ideas from Xen and virtio and is
> also virtualization-friendly.
> 
> In turn, the ring buffer interface is great if you want to have mostly
> asynchronous operation---if not, the ring buffer is just adding
> complexity.  Sure, we have the luxury of abstractions and powerful
> computers that hide most of the complexity, but some of it still lurks
> in the form of race conditions.
> 
> So the question for virtio-mem is what makes asynchronous operation
> important for memory hotplug?  If I understand the virtio-mem driver,
> all interaction with the virtio device happens through a work item,
> meaning that it is strictly synchronous.  At this point, you do not need
> a ring buffer, you only need:

So, the main building pieces virtio-mem uses as of now in the virtio
infrastructure are the config space and one virtqueue.

a) A way for the host to send requests to the guest. E.g., request a
certain amount of memory to be plugged/unplugged by the guest. Done via
config space updates (e.g., similar to virtio-balloon
inflation/deflation requests).
b) A way for the guest to communicate with the host. E.g., send
plug/unplug requests to plug/unplug separate memory blocks. Done via a
virtqueue. Similar to inflation/deflation of pages in virtio-balloon.

Requests by the host via the config space are processed asynchronously
by the guest (again, similar to - say - virtio-balloon). Guest requests
are currently processed synchronously by the host.

Guest: Can I plug this block?
Host: Sorry, No can do.

Can't tell if there might be extensions (if virtio-mem ever comes to
life ;) ) that might make use of asynchronous communication. Especially,
there might be asynchronous/multiple guest->host requests at some point
(e.g., "I'm nearly out of memory, please send help").

So yes, currently we could live without the ring buffer. But the config
space and the virtqueue are real life-savers for me right now :)

> 
> - a command register where you write the address of a command buffer.
> The device will do DMA from the command block, do whatever it has to do,
> DMA back the results, and trigger an interrupt.
> 
> - an interrupt mechanism.  It could be MSI, or it could be an interrupt
> pending/interrupt acknowledge register if all the hardware offers is
> level-triggered interrupts.
> 
> I do agree that virtio-mem's command buffer/DMA architecture is better
> than the more traditional "bunch of hardware registers" architecture
> that QEMU uses for its ACPI-based CPU and memory hotplug controllers.
> But that's because command buffer/DMA is what actually defines a good
> paravirtualized interface; virtio is a superset of that that may not be
> always a good solution.
> 

I completely agree to what you say here, virtio comes with complexity,
but also with features (e.g., config space, support for multiple queues,
abstraction of transports).

Say, I would only want to expose a DIMM to the guest just like via ACPI,
virtio would clearly not be the right choice.
Sergio Lopez Pascual Sept. 25, 2019, 11:04 a.m. UTC | #21
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 25/09/19 10:40, Sergio Lopez wrote:
>>>> We need the PIT for non-KVM accel (if present with KVM and
>>>> kernel_irqchip_split = off, it basically becomes a placeholder)
>>> Why?
>> 
>> Perhaps I'm missing something. Is some other device supposed to be
>> acting as a HW timer while running with TCG acceleration?
>
> Sure, the LAPIC timer.  I wonder if Linux, however, wants to use the PIT
> in order to calibrate the LAPIC timer if TSC deadline mode is unavailable.

Ah, yes. I was so confused by the nomenclature that I assumed we didn't
have a userspace implementation of it.

On the other hand, as you suspect, without the PIT Linux does hang in
TSC calibration with TCG accel.

A simple option could be adding it only if we're running without KVM.

>>>> and the PIC for both the PIT and the ISA serial port.
>>>
>>> Can't the ISA serial port work with the IOAPIC?
>> 
>> Hm... I'm not sure. I wanted to give it a try, but then noticed that
>> multiple places in the code (like hw/intc/apic.c:560) do expect to have
>> an ISA PIC present through the isa_pic global variable.
>> 
>> I guess we should be able to work around this, but I'm not sure if it's
>> really worth it. What do you think?
>
> You can add a paragraph saying that in the future the list could be
> reduced further.  I think that the direction we want to go is to only
> leave the IOAPIC around (the ISA devices in this respect are no
> different from the virtio-mmio devices).
>
> But you're right about isa_pic.  I wonder if it's as easy as this:
>
> diff --git a/hw/intc/apic.c b/hw/intc/apic.c
> index bce89911dc..5d03e48a19 100644
> --- a/hw/intc/apic.c
> +++ b/hw/intc/apic.c
> @@ -610,7 +610,7 @@ int apic_accept_pic_intr(DeviceState *dev)
>
>      if ((s->apicbase & MSR_IA32_APICBASE_ENABLE) == 0 ||
>          (lvt0 & APIC_LVT_MASKED) == 0)
> -        return 1;
> +        return isa_pic != NULL;
>
>      return 0;
>  }

Yes, that would do the trick. There's another use of it at
hw/intc/ioapic.c:78, but we should be safe as, at least in the case of
Linux, DM_EXTINT is only used in check_timer(), which is only called if
it detects a i8259 PIC.

We should probably add an assertion with an informative message, just in
case.

Thanks,
Sergio.
Paolo Bonzini Sept. 25, 2019, 11:20 a.m. UTC | #22
On 25/09/19 13:04, Sergio Lopez wrote:
> Yes, that would do the trick. There's another use of it at
> hw/intc/ioapic.c:78, but we should be safe as, at least in the case of
> Linux, DM_EXTINT is only used in check_timer(), which is only called if
> it detects a i8259 PIC.

Even there it is actually LVT0's DM_EXTINT, not the IOAPIC's.  I think
pic_read_irq would have returned 7 (spurious IRQ on master i8259) until
commit 29bb5317cb ("i8259: QOM cleanups", 2013-04-29), so we should fix it.

Paolo

> We should probably add an assertion with an informative message, just in
> case.
Paolo Bonzini Sept. 25, 2019, 11:24 a.m. UTC | #23
On 25/09/19 12:50, David Hildenbrand wrote:
> Can't tell if there might be extensions (if virtio-mem ever comes to
> life ;) ) that might make use of asynchronous communication. Especially,
> there might be asynchronous/multiple guest->host requests at some point
> (e.g., "I'm nearly out of memory, please send help").

Okay, this makes sense.  I'm almost sold on it. :)

Config space also makes sense, though what you really need is the config
space interrupt, rather than config space per se.

Paolo

> So yes, currently we could live without the ring buffer. But the config
> space and the virtqueue are real life-savers for me right now :)
David Hildenbrand Sept. 25, 2019, 11:32 a.m. UTC | #24
On 25.09.19 13:24, Paolo Bonzini wrote:
> On 25/09/19 12:50, David Hildenbrand wrote:
>> Can't tell if there might be extensions (if virtio-mem ever comes to
>> life ;) ) that might make use of asynchronous communication. Especially,
>> there might be asynchronous/multiple guest->host requests at some point
>> (e.g., "I'm nearly out of memory, please send help").
> 
> Okay, this makes sense.  I'm almost sold on it. :)
> 
> Config space also makes sense, though what you really need is the config
> space interrupt, rather than config space per se.
> 

Right, and feature negotiation is yet another nice-to-have thingy in the
virtio world :)

> Paolo
Philippe Mathieu-Daudé Sept. 25, 2019, 11:33 a.m. UTC | #25
On 9/25/19 7:51 AM, Sergio Lopez wrote:
> Peter Maydell <peter.maydell@linaro.org> writes:
> 
>> On Tue, 24 Sep 2019 at 14:25, Sergio Lopez <slp@redhat.com> wrote:
>>>
>>> Microvm is a machine type inspired by both NEMU and Firecracker, and
>>> constructed after the machine model implemented by the latter.
>>>
>>> It's main purpose is providing users a minimalist machine type free
>>> from the burden of legacy compatibility, serving as a stepping stone
>>> for future projects aiming at improving boot times, reducing the
>>> attack surface and slimming down QEMU's footprint.
>>
>>
>>>  docs/microvm.txt                 |  78 +++
>>
>> I'm not sure how close to acceptance this patchset is at the
>> moment, so not necessarily something you need to do now,
>> but could new documentation in docs/ be in rst format, not
>> plain text, please? (Ideally also they should be in the right
>> manual subdirectory, but documentation of system emulation
>> machines at the moment is still in texinfo format, so we
>> don't have a subdir for it yet.)
> 
> Sure. What I didn't get is, should I put it in "docs/microvm.rst" or in
> some other subdirectory?

Should we introduce docs/machines/?
Peter Maydell Sept. 25, 2019, 12:39 p.m. UTC | #26
On Wed, 25 Sep 2019 at 12:33, Philippe Mathieu-Daudé <philmd@redhat.com> wrote:
>
> On 9/25/19 7:51 AM, Sergio Lopez wrote:
> > Peter Maydell <peter.maydell@linaro.org> writes:
> >
> >> On Tue, 24 Sep 2019 at 14:25, Sergio Lopez <slp@redhat.com> wrote:
> >>>
> >>> Microvm is a machine type inspired by both NEMU and Firecracker, and
> >>> constructed after the machine model implemented by the latter.
> >>>
> >>> It's main purpose is providing users a minimalist machine type free
> >>> from the burden of legacy compatibility, serving as a stepping stone
> >>> for future projects aiming at improving boot times, reducing the
> >>> attack surface and slimming down QEMU's footprint.
> >>
> >>
> >>>  docs/microvm.txt                 |  78 +++
> >>
> >> I'm not sure how close to acceptance this patchset is at the
> >> moment, so not necessarily something you need to do now,
> >> but could new documentation in docs/ be in rst format, not
> >> plain text, please? (Ideally also they should be in the right
> >> manual subdirectory, but documentation of system emulation
> >> machines at the moment is still in texinfo format, so we
> >> don't have a subdir for it yet.)
> >
> > Sure. What I didn't get is, should I put it in "docs/microvm.rst" or in
> > some other subdirectory?
>
> Should we introduce docs/machines/?

This should live in the not-yet-created docs/system (the "system emulation
user's guide"), along with much of the content currently still in
the texinfo docs. But we don't have that structure yet and won't
until we do the texinfo conversion, so I think for the moment we
have two reasonable choices:
 (1) put it in the texinfo, so it is at least shipped to
     users until we get around to doing our docs conversion
 (2) leave it in docs/microvm.rst for now (we have a bunch
     of other docs in docs/ which are basically there because
     they're also awaiting the texinfo conversion and creation
     of the docs/user and docs/system manuals)

My ideal vision of how to do documentation of individual
machines, incidentally, would be to do it via doc comments
or some other kind of structured markup in the .c files
that define the machine, so that we could automatically
collect up the docs for the machines we're building,
put them in to per-architecture sections of the docs,
have autogenerated stub "this machine exists but isn't
documented yet" entries, etc. But that's not something that
we could easily do today so I don't want to block interim
improvements to our documentation just because I have some
nice theoretical idea for how it ought to work :-)

thanks
-- PMM
Sergio Lopez Pascual Sept. 25, 2019, 3:04 p.m. UTC | #27
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 24/09/19 14:44, Sergio Lopez wrote:
>> +Microvm is a machine type inspired by both NEMU and Firecracker, and
>> +constructed after the machine model implemented by the latter.
>
> I would say it's inspired by Firecracker only.  The NEMU virt machine
> had virtio-pci and ACPI.
>
>> +It's main purpose is providing users a minimalist machine type free
>> +from the burden of legacy compatibility,
>
> I think this is too strong, especially if you keep the PIC and PIT. :)
> Maybe just "It's a minimalist machine type without PCI support designed
> for short-lived guests".
>
>> +serving as a stepping stone
>> +for future projects aiming at improving boot times, reducing the
>> +attack surface and slimming down QEMU's footprint.
>
> "Microvm also establishes a baseline for benchmarking QEMU and operating
> systems, since it is optimized for both boot time and footprint".
>
>> +The microvm machine type supports the following devices:
>> +
>> + - ISA bus
>> + - i8259 PIC
>> + - LAPIC (implicit if using KVM)
>> + - IOAPIC (defaults to kernel_irqchip_split = true)
>> + - i8254 PIT
>
> Do we need the PIT?  And perhaps the PIC even?
>

I'm going back to this level of the thread, because after your
suggestion I took a deeper look at how things work around the PIC, and
discovered I was completely wrong about my assumptions.

For virtio-mmio devices, given that we don't have the ability to
configure vectors (as it's done in the PCI case) we're stuck with the
ones provided by the platform PIC, which in the x86 case is the i8259
(at least from Linux's perspective).

So we can get rid of the IOAPIC, but we need to keep the i8259 (we have
both a userspace and a kernel implementation too, so it should be fine).

As for the PIT, we can omit it if we're running with KVM acceleration,
as kvmclock will be used to calculate loops per jiffie and avoid the
calibration, leaving it enabled otherwise.

Thanks,
Sergio.
Paolo Bonzini Sept. 25, 2019, 4:46 p.m. UTC | #28
On 25/09/19 17:04, Sergio Lopez wrote:
> I'm going back to this level of the thread, because after your
> suggestion I took a deeper look at how things work around the PIC, and
> discovered I was completely wrong about my assumptions.
> 
> For virtio-mmio devices, given that we don't have the ability to
> configure vectors (as it's done in the PCI case) we're stuck with the
> ones provided by the platform PIC, which in the x86 case is the i8259
> (at least from Linux's perspective).
> 
> So we can get rid of the IOAPIC, but we need to keep the i8259 (we have
> both a userspace and a kernel implementation too, so it should be fine).

Hmm...  I would have thought the vectors are just GSIs, which will be
configured to the IOAPIC if it is present.  Maybe something is causing
Linux to ignore the IOAPIC?

> As for the PIT, we can omit it if we're running with KVM acceleration,
> as kvmclock will be used to calculate loops per jiffie and avoid the
> calibration, leaving it enabled otherwise.

Can you make it an OnOffAuto property, and default to on iff !KVM?

Paolo
Sergio Lopez Pascual Sept. 26, 2019, 6:23 a.m. UTC | #29
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 25/09/19 17:04, Sergio Lopez wrote:
>> I'm going back to this level of the thread, because after your
>> suggestion I took a deeper look at how things work around the PIC, and
>> discovered I was completely wrong about my assumptions.
>> 
>> For virtio-mmio devices, given that we don't have the ability to
>> configure vectors (as it's done in the PCI case) we're stuck with the
>> ones provided by the platform PIC, which in the x86 case is the i8259
>> (at least from Linux's perspective).
>> 
>> So we can get rid of the IOAPIC, but we need to keep the i8259 (we have
>> both a userspace and a kernel implementation too, so it should be fine).
>
> Hmm...  I would have thought the vectors are just GSIs, which will be
> configured to the IOAPIC if it is present.  Maybe something is causing
> Linux to ignore the IOAPIC?

Turns out it was a bug in microvm. I was writing 0 to FW_CFG_NB_CPUS
(because I was using x86ms->boot_cpus instead of ms->smp.cpus), which
led to a broken MP table, causing Linux to ignore it and, as a side
effect to disable IOAPIC symmetric I/O mode.

After fixing it we can, indeed, boot without the i8259 \o/ :

/ # dmesg | grep legacy
[    0.074144] Using NULL legacy PIC
/ # cat /pr[   12.116930] random: fast init done
/ # cat /proc/interrupts 
           CPU0       CPU1       
  4:          0        278   IO-APIC   4-edge      ttyS0
 12:         48          0   IO-APIC  12-edge      virtio0
NMI:          0          0   Non-maskable interrupts
LOC:        124         98   Local timer interrupts
SPU:          0          0   Spurious interrupts
PMI:          0          0   Performance monitoring interrupts
IWI:          0          0   IRQ work interrupts
RTR:          0          0   APIC ICR read retries
RES:        476        535   Rescheduling interrupts
CAL:          0         76   Function call interrupts
TLB:          0          0   TLB shootdowns
HYP:          0          0   Hypervisor callback interrupts
ERR:          0
MIS:          0
PIN:          0          0   Posted-interrupt notification event
NPI:          0          0   Nested posted-interrupt event
PIW:          0          0   Posted-interrupt wakeup event

There's still one problem. If the Guest doesn't have TSC_DEADLINE_TIME,
Linux hangs on APIC timer calibration. I'm looking for a way to work
around this. Worst case scenario, we can check for that feature and add
both PIC and PIT if is missing.

>> As for the PIT, we can omit it if we're running with KVM acceleration,
>> as kvmclock will be used to calculate loops per jiffie and avoid the
>> calibration, leaving it enabled otherwise.
>
> Can you make it an OnOffAuto property, and default to on iff !KVM?

Sure.

Thanks,
Sergio.
Christian Borntraeger Sept. 26, 2019, 7:48 a.m. UTC | #30
On 24.09.19 14:44, Sergio Lopez wrote:
> Microvm is a machine type inspired by both NEMU and Firecracker, and
> constructed after the machine model implemented by the latter.
> 
> It's main purpose is providing users a minimalist machine type free
> from the burden of legacy compatibility, serving as a stepping stone
> for future projects aiming at improving boot times, reducing the
> attack surface and slimming down QEMU's footprint.
> 
> The microvm machine type supports the following devices:
> 
>  - ISA bus
>  - i8259 PIC
>  - LAPIC (implicit if using KVM)
>  - IOAPIC (defaults to kernel_irqchip_split = true)
>  - i8254 PIT
>  - MC146818 RTC (optional)
>  - kvmclock (if using KVM)
>  - fw_cfg
>  - One ISA serial port (optional)
>  - Up to eight virtio-mmio devices (configured by the user)

Just out of curiosity. 
What is the reason for not going virtio-pci? Is the PCI bus really
that expensive and complicated?
FWIW, I do not complain. When people start using virtio-mmio more
often this would also help virtio-ccw (which I am interested in)
as this forces people to think beyond virtio-pci.
Sergio Lopez Pascual Sept. 26, 2019, 8:22 a.m. UTC | #31
Christian Borntraeger <borntraeger@de.ibm.com> writes:

> On 24.09.19 14:44, Sergio Lopez wrote:
>> Microvm is a machine type inspired by both NEMU and Firecracker, and
>> constructed after the machine model implemented by the latter.
>> 
>> It's main purpose is providing users a minimalist machine type free
>> from the burden of legacy compatibility, serving as a stepping stone
>> for future projects aiming at improving boot times, reducing the
>> attack surface and slimming down QEMU's footprint.
>> 
>> The microvm machine type supports the following devices:
>> 
>>  - ISA bus
>>  - i8259 PIC
>>  - LAPIC (implicit if using KVM)
>>  - IOAPIC (defaults to kernel_irqchip_split = true)
>>  - i8254 PIT
>>  - MC146818 RTC (optional)
>>  - kvmclock (if using KVM)
>>  - fw_cfg
>>  - One ISA serial port (optional)
>>  - Up to eight virtio-mmio devices (configured by the user)
>
> Just out of curiosity. 
> What is the reason for not going virtio-pci? Is the PCI bus really
> that expensive and complicated?

Well, expensive is a relative term. PCI does indeed require a
significant amount of code and cycles, but that's for a good reason, as
it provides an extensive bus logic allowing things like vector
configuration, hot-plug, chaining, etc...

On the other hand, MMIO lacks any kind of bus logic, as it basically
works by saying "hey, take a look at this address, there may be
something there" to the kernel, so of course is cheaper. This makes it
ideal for microvm's aim of supporting a VM with the smallest amount of
code, but bad for almost everything else.

I don't think this means PCI is expensive. That would be the case if
there were a bus providing similar functionality while requiring less
code and cycles. And this is definitely not the case of MMIO.

In other words, I think PCI cost is justified by its use case, while
MMIO simplicity makes it ideal for some specific purposes (like
microvm).

Cheers,
Sergio.

> FWIW, I do not complain. When people start using virtio-mmio more
> often this would also help virtio-ccw (which I am interested in)
> as this forces people to think beyond virtio-pci.
Paolo Bonzini Sept. 26, 2019, 8:58 a.m. UTC | #32
On 26/09/19 08:23, Sergio Lopez wrote:
> 
> There's still one problem. If the Guest doesn't have TSC_DEADLINE_TIME,
> Linux hangs on APIC timer calibration. I'm looking for a way to work
> around this. Worst case scenario, we can check for that feature and add
> both PIC and PIT if is missing.
> 

Huh, that's a silly thing that Linux is doing!  If KVM is in use, the
LAPIC timer frequency is known to be 1 GHz.

arch/x86/kernel/kvm.c can just set

	lapic_timer_period = 1000000000 / HZ;

and that should disabled LAPIC calibration if TSC deadline is absent.

Paolo
Sergio Lopez Pascual Sept. 26, 2019, 10:16 a.m. UTC | #33
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 26/09/19 08:23, Sergio Lopez wrote:
>> 
>> There's still one problem. If the Guest doesn't have TSC_DEADLINE_TIME,
>> Linux hangs on APIC timer calibration. I'm looking for a way to work
>> around this. Worst case scenario, we can check for that feature and add
>> both PIC and PIT if is missing.
>> 
>
> Huh, that's a silly thing that Linux is doing!  If KVM is in use, the
> LAPIC timer frequency is known to be 1 GHz.
>
> arch/x86/kernel/kvm.c can just set
>
> 	lapic_timer_period = 1000000000 / HZ;
>
> and that should disabled LAPIC calibration if TSC deadline is absent.

Given that they can only be omitted when an specific set of conditions
is met, I think I'm going to make them optional but enabled by default.

I'll also point to this in the documentation.

Thanks,
Sergio
Paolo Bonzini Sept. 26, 2019, 10:21 a.m. UTC | #34
On 26/09/19 12:16, Sergio Lopez wrote:
>> If KVM is in use, the
>> LAPIC timer frequency is known to be 1 GHz.
>>
>> arch/x86/kernel/kvm.c can just set
>>
>> 	lapic_timer_period = 1000000000 / HZ;
>>
>> and that should disabled LAPIC calibration if TSC deadline is absent.
> Given that they can only be omitted when an specific set of conditions
> is met, I think I'm going to make them optional but enabled by default.

Please do introduce the infrastructure to make them OnOffAuto, and for
now make Auto the same as On.  We have time to review that since microvm
is not versioned.

Thanks,

Paolo

> I'll also point to this in the documentation.
Sergio Lopez Pascual Sept. 26, 2019, 12:12 p.m. UTC | #35
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 26/09/19 12:16, Sergio Lopez wrote:
>>> If KVM is in use, the
>>> LAPIC timer frequency is known to be 1 GHz.
>>>
>>> arch/x86/kernel/kvm.c can just set
>>>
>>> 	lapic_timer_period = 1000000000 / HZ;
>>>
>>> and that should disabled LAPIC calibration if TSC deadline is absent.
>> Given that they can only be omitted when an specific set of conditions
>> is met, I think I'm going to make them optional but enabled by default.
>
> Please do introduce the infrastructure to make them OnOffAuto, and for
> now make Auto the same as On.  We have time to review that since microvm
> is not versioned.

OK, sounds like a good idea to me.

Thanks,
Sergio.