Message ID | 1ee67238bd543959c3218612bff4acca06d15baa.1571905346.git.jag.raman@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Initial support of multi-process qemu | expand |
On Thu, Oct 24, 2019 at 05:09:29AM -0400, Jagannathan Raman wrote: > From: John G Johnson <john.g.johnson@oracle.com> > > Signed-off-by: John G Johnson <john.g.johnson@oracle.com> > Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com> > Signed-off-by: Jagannathan Raman <jag.raman@oracle.com> > --- > v2 -> v3: > - Updated with latest design of this project > > v3 -> v4: > - Updated document to RST format > Hi, The warning was reported in regards to this patch because the index for the multi-process document is incorrect as pointed by the automated tests. "/tmp/qemu-test/src/docs/devel/index.rst:13:toctree contains reference to nonexisting document 'multi-process'". The correct version of this patch is available. Should that be sent in the next series or can be correct version attached here? Thank you! Elena, Jag and JJ. > docs/devel/index.rst | 1 + > docs/devel/qemu-multiprocess.rst | 1102 ++++++++++++++++++++++++++++++++++++++ > 2 files changed, 1103 insertions(+) > create mode 100644 docs/devel/qemu-multiprocess.rst > > diff --git a/docs/devel/index.rst b/docs/devel/index.rst > index 1ec61fc..edd3fe3 100644 > --- a/docs/devel/index.rst > +++ b/docs/devel/index.rst > @@ -22,3 +22,4 @@ Contents: > decodetree > secure-coding-practices > tcg > + multi-process > diff --git a/docs/devel/qemu-multiprocess.rst b/docs/devel/qemu-multiprocess.rst > new file mode 100644 > index 0000000..2c42c6e > --- /dev/null > +++ b/docs/devel/qemu-multiprocess.rst > @@ -0,0 +1,1102 @@ > +Disaggregating QEMU > +=================== > + > +QEMU is often used as the hypervisor for virtual machines running in the > +Oracle cloud. Since one of the advantages of cloud computing is the > +ability to run many VMs from different tenants in the same cloud > +infrastructure, a guest that compromised its hypervisor could > +potentially use the hypervisor's access privileges to access data it is > +not authorized for. > + > +QEMU can be susceptible to security attack because it is a large, > +monolithic program that provides many features to the VMs it services. > +Many of these feature can be configured out of QEMU, but even a reduced > +configuration QEMU has a large amount of code a guest can potentially > +attack in order to gain additional privileges. > + > +QEMU services > +------------- > + > +QEMU can be broadly described as providing three main services. One is a > +VM control point, where VMs can be created, migrated, re-configured, and > +destroyed. A second is to emulate the CPU instructions within the VM, > +often accelerated by HW virtualization features such as Intel's VT > +extensions. Finally, it provides IO services to the VM by emulating HW > +IO devices, such as disk and network devices. > + > +A disaggregated QEMU > +~~~~~~~~~~~~~~~~~~~~ > + > +A disaggregated QEMU involves separating QEMU services into separate > +host processes. Each of these processes can be given only the privileges > +it needs to provide its service, e.g., a disk service could be given > +access only the the disk images it provides, and not be allowed to > +access other files, or any network devices. An attacker who compromised > +this service would not be able to use this exploit to access files or > +devices beyond what the disk service was given access to. > + > +A QEMU control process would remain, but in disaggregated mode, it would > +be a control point that executes the processes needed to support the VM > +being created, but have no direct interfaces to the VM. During VM > +execution, it would still provide the user interface to hot-plug devices > +or live migrate the VM. > + > +A first step in creating a disaggregated QEMU is to separate IO services > +from the main QEMU program, which would continue to provide CPU > +emulation. i.e., the control process would also be the CPU emulation > +process. In a later phase, CPU emulation could be separated from the > +control process. > + > +Disaggregating IO services > +-------------------------- > + > +Disaggregating IO services is a good place to begin QEMU disaggregating > +for a couple of reasons. One is the sheer number of IO devices QEMU can > +emulate provides a large surface of interfaces which could potentially > +be exploited, and, indeed, have been a source of exploits in the past. > +Another is the modular nature of QEMU device emulation code provides > +interface points where the QEMU functions that perform device emulation > +can be separated from the QEMU functions that manage the emulation of > +guest CPU instructions. > + > +QEMU device emulation > +~~~~~~~~~~~~~~~~~~~~~ > + > +QEMU uses a object oriented SW architecture for device emulation code. > +Configured objects are all compiled into the QEMU binary, then objects > +are instantiated by name when used by the guest VM. For example, the > +code to emulate a device named "foo" is always present in QEMU, but its > +instantiation code is only run when the device is included in the target > +VM. (e.g., via the QEMU command line as *-device foo*) > + > +The object model is hierarchical, so device emulation code names its > +parent object (such as "pci-device" for a PCI device) and QEMU will > +instantiate a parent object before calling the device's instantiation > +code. > + > +Current separation models > +~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +In order to separate the device emulation code from the CPU emulation > +code, the device object code must run in a different process. There are > +a couple of existing QEMU features that can run emulation code > +separately from the main QEMU process. These are examined below. > + > +vhost user model > +^^^^^^^^^^^^^^^^ > + > +Virtio guest device drivers can be connected to vhost user applications > +in order to perform their IO operations. This model uses special virtio > +device drivers in the guest and vhost user device objects in QEMU, but > +once the QEMU vhost user code has configured the vhost user application, > +mission-mode IO is performed by the application. The vhost user > +application is a daemon process that can be contacted via a known UNIX > +domain socket. > + > +vhost socket > +'''''''''''' > + > +As mentioned above, one of the tasks of the vhost device object within > +QEMU is to contact the vhost application and send it configuration > +information about this device instance. As part of the configuration > +process, the application can also be sent other file descriptors over > +the socket, which then can be used by the vhost user application in > +various ways, some of which are described below. > + > +vhost MMIO store acceleration > +''''''''''''''''''''''''''''' > + > +VMs are often run using HW virtualization features via the KVM kernel > +driver. This driver allows QEMU to accelerate the emulation of guest CPU > +instructions by running the guest in a virtual HW mode. When the guest > +executes instructions that cannot be executed by virtual HW mode, > +execution returns to the KVM driver so it can inform QEMU to emulate the > +instructions in SW. > + > +One of the events that can cause a return to QEMU is when a guest device > +driver accesses an IO location. QEMU then dispatches the memory > +operation to the corresponding QEMU device object. In the case of a > +vhost user device, the memory operation would need to be sent over a > +socket to the vhost application. This path is accelerated by the QEMU > +virtio code by setting up an eventfd file descriptor that the vhost > +application can directly receive MMIO store notifications from the KVM > +driver, instead of needing them to be sent to the QEMU process first. > + > +vhost interrupt acceleration > +'''''''''''''''''''''''''''' > + > +Another optimization used by the vhost application is the ability to > +directly inject interrupts into the VM via the KVM driver, again, > +bypassing the need to send the interrupt back to the QEMU process first. > +The QEMU virtio setup code configures the KVM driver with an eventfd > +that triggers the device interrupt in the guest when the eventfd is > +written. This irqfd file descriptor is then passed to the vhost user > +application program. > + > +vhost access to guest memory > +'''''''''''''''''''''''''''' > + > +The vhost application is also allowed to directly access guest memory, > +instead of needing to send the data as messages to QEMU. This is also > +done with file descriptors sent to the vhost user application by QEMU. > +These descriptors can be passed to ``mmap()`` by the vhost application > +to map the guest address space into the vhost application. > + > +IOMMUs introduce another level of complexity, since the address given to > +the guest virtio device to DMA to or from is not a guest physical > +address. This case is handled by having vhost code within QEMU register > +as a listener for IOMMU mapping changes. The vhost application maintains > +a cache of IOMMMU translations: sending translation requests back to > +QEMU on cache misses, and in turn receiving flush requests from QEMU > +when mappings are purged. > + > +applicability to device separation > +'''''''''''''''''''''''''''''''''' > + > +Much of the vhost model can be re-used by separated device emulation. In > +particular, the ideas of using a socket between QEMU and the device > +emulation application, using a file descriptor to inject interrupts into > +the VM via KVM, and allowing the application to ``mmap()`` the guest > +should be re used. > + > +There are, however, some notable differences between how a vhost > +application works and the needs of separated device emulation. The most > +basic is that vhost uses custom virtio device drivers which always > +trigger IO with MMIO stores. A separated device emulation model must > +work with existing IO device models and guest device drivers. MMIO loads > +break vhost store acceleration since they are synchronous - guest > +progress cannot continue until the load has been emulated. By contrast, > +stores are asynchronous, the guest can continue after the store event > +has been sent to the vhost application. > + > +Another difference is that in the vhost user model, a single daemon can > +support multiple QEMU instances. This is contrary to the security regime > +desired, in which the emulation application should only be allowed to > +access the files or devices the VM it's running on behalf of can access. > +#### qemu-io model > + > +Qemu-io is a test harness used to test changes to the QEMU block backend > +object code. (e.g., the code that implements disk images for disk driver > +emulation) Qemu-io is not a device emulation application per se, but it > +does compile the QEMU block objects into a separate binary from the main > +QEMU one. This could be useful for disk device emulation, since its > +emulation applications will need to include the QEMU block objects. > + > +New separation model based on proxy objects > +------------------------------------------- > + > +A different model based on proxy objects in the QEMU program > +communicating with remote emulation programs could provide separation > +while minimizing the changes needed to the device emulation code. The > +rest of this section is a discussion of how a proxy object model would > +work. > + > +Remote emulation processes > +~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +The remote emulation process will run the QEMU object hierarchy without > +modification. The device emulation objects will be also be based on the > +QEMU code, because for anything but the simplest device, it would not be > +a tractable to re-implement both the object model and the many device > +backends that QEMU has. > + > +The processes will communicate with the QEMU process over UNIX domain > +sockets. The processes can be executed either as standalone processes, > +or be executed by QEMU. In both cases, the host backends the emulation > +processes will provide are specified on its command line, as they would > +be for QEMU. For example: > + > +:: > + > + disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \ > + -blockdev driver=qcow2,node-name=drive0,file=file0 > + > +would indicate process *disk-proc* uses a qcow2 emulated disk named > +*file0* as its backend. > + > +Emulation processes may emulate more than one guest controller. A common > +configuration might be to put all controllers of the same device class > +(e.g., disk, network, etc.) in a single process, so that all backends of > +the same type can be managed by a single QMP monitor. > + > +communication with QEMU > +^^^^^^^^^^^^^^^^^^^^^^^ > + > +Remote emulation processes will recognize a *-socket* argument that > +specifies the path of a UNIX domain socket used to communicate with the > +QEMU process. If no *-socket* argument is present, the process will use > +file descriptor 0 to communicate with QEMU. For example, > + > +:: > + > + disk-proc -socket /tmp/disk0-sock <backend list> > + > +will communicate with QEMU using the socket path */tmp/dik0-sock*. > + > +remote process QMP monitor > +^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Remote emulation processes can be monitored via QMP, similar to QEMU > +itself. The QMP monitor socket is specified the same as for a QEMU > +process: > + > +:: > + > + disk-proc -qmp unix:/tmp/disk-mon,server > + > +can be monitored over the UNIX socket path */tmp/disk-mon*. > + > +QEMU command line > +~~~~~~~~~~~~~~~~~ > + > +The QEMU command line options will need to be modified to indicate which > +items are emulated by a separate program, and which remain emulated by > +QEMU itself. > + > +identifying remote emulation processes > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Remote emulation processes will be identified to QEMU using a *-remote* > +command line option. This option can either specify a command that QEMU > +will execute, or can specify a UNIX domain socket that QEMU can use to > +connect to an existing process. Both forms require a "id" option that > +identifies the process to later *-device* options. The process version > +is: > + > +:: > + > + -remote id=disk-proc,command="disk-proc <backend list>" > + > +And the socket version is: > + > +:: > + > + -remote id=disk-proc,socket="/tmp/disk0-sock" > + > +In the latter case, the remote process must be given the same socket on > +its command line when it is executed: > + > +:: > + > + disk-proc -socket /tmp/disk0-sock <backend list> > + > +identifying devices emulated remotely > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Devices that are to be emulated in a separate process will be identify > +the remote process with a "remote" option on their *-device* command > +line specification. e.g., an LSI SCSI controller and disk can be > +specified as: > + > +:: > + > + -device lsi53c895a,id=scsi0 > + -device scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0 > + > +If these devices are emulated by remote process "disk-proc," as > +described in the previous section, the QEMU command line would be: > + > +:: > + > + -device lsi53c895a,id=scsi0,remote=disk-proc > + -device scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0,remote=disk-proc > + > +Some devices are implicitly created by the machine object. e.g., the q35 > +machine object will create its PCI bus, and attach an ich9-ahci IDE > +controller to it. In this case, options will need to be added to the > +*-machine* command line. e.g., > + > +:: > + > + -machine pc-q35,ide-remote=disk-proc > + > +will use the remote process with an "id" of "disk-proc" to emulate the > +IDE controller and its disks. > + > +The disks themselves still need to be specified with *-remote* option, > +as in the example above. e.g., > + > +:: > + > + -device ide-hd,drive=drive0,bus=ide.0,unit=0,remote=disk-proc > + > +QEMU management of remote processes > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Each *-remote* instance on the QEMU command line will create a remote > +process proxy instance in QEMU. They will be held on a *QList* that can > +be searched for by its "id" property. The remote process proxy will also > +establish a communication channel between QEMU and the remote process. > +This can be done in one of two methods: direction execution of the > +process by QEMU with ``fork()`` and ``exec()`` system calls, or by > +connecting to an existing process. > + > +direct execution > +^^^^^^^^^^^^^^^^ > + > +When the remote process is directly executed, the remote process proxy > +will setup a communication channel between itself and the emulation > +process. This channel will be created using ``socketpair()`` and the > +remote process side of the pair will be given to the process as file > +descriptor 0. > + > +connecting to an existing process > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Some environments wish to deny QEMU the ability to execute ``fork()`` > +and ``exec()`` In these case, emulation processes will be started before > +QEMU, and a UNIX domain socket will be given to each emulation process > +to communicate with QEMU over. After communication is established, the > +socket will be unlinked from the file system space by the QEMU process. > + > +communication with emulation process > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +primary socket > +'''''''''''''' > + > +Whether the process was executed by QEMU or externally, there will be a > +primary socket for communication between QEMU and the remote process. > +This channel will handle configuration commands from QEMU to the > +process, either from the QEMU command line, or from QMP commands that > +affect the devices being emulated by the process. This channel will only > +allow one message to be pending at a time; if additional messages > +arrive, they must wait for previous ones to be acknowledged from the > +remote side. > + > +secondary sockets > +''''''''''''''''' > + > +The primary socket can pass the file descriptors of secondary sockets > +for operations that occur in parallel with commands on the primary > +channel. These include MMIO operations generated by the guest, interrupt > +notifications generated by the devices being emulated, or *vmstate* for > +live migration. These secondary sockets will be created at the behest of > +the device proxies that require them. A disk device proxy wouldn't need > +any secondary sockets, but a disk controller device proxy may need both > +an MMIO socket and an interrupt socket. > + > +emulation process attached via QMP command > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +There will be a new "attach-process" QMP command to facilitate device > +hot-plug. This command's arguments will be the same as the *-remote* > +command line when it's used to attach to a remote process. i.e., it will > +need an "id" argument so that hot-plugged devices can later find it, and > +a "socket" argument to identify the UNIX domain socket that will be used > +to communicate with QEMU. > + > +QEMU device proxy objects > +~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +QEMU has an object model based on sub-classes inherited from the > +"object" super-class. The sub-classes that are of interest here are the > +"device" and "bus" sub-classes whose child sub-classes make up the > +device tree of a QEMU emulated system. > + > +The proxy object model will use device proxy objects to replace the > +device emulation code within the QEMU process. These objects will live > +in the same place in the object and bus hierarchies as the objects they > +replace. i.e., the proxy object for an LSI SCSI controller will be a > +sub-class of the "pci-device" class, and will have the same PCI bus > +parent and the same SCSI bus child objects as the LSI controller object > +it replaces. > + > +After the QEMU command line has been parsed, the remote devices will be > +instantiated in the same manner as local devices are. (i.e., > +``qdev_device_add()``). In order to distinguish them from regular > +*-device* device objects, their class name will be the name of the class > +it replaces, with "-proxy" appended. e.g., the "lsi53c895a" proxy class > +will be "lsi53c895a-proxy." > + > +device JSON description > +^^^^^^^^^^^^^^^^^^^^^^^ > + > +The remote process needs a JSON representation of the command line > +options used to create the object. This JSON representation is used to > +create the corresponding object in the emulation process. e.g., for an > +LSI SCSI controller invoked as: > + > +:: > + > + -device lsi53c895a,id=scsi0,remote=lsi-scsi > + > +the proxy object would create a > + > +:: > + > + { "driver" : "lsi53c895a", "id" : "scsi0" } > + > +JSON description. The "driver" option is assigned to the device name > +when the command line is parsed, so the "-proxy" appended by the command > +line parsing code is removed. The "remote" option isn't needed in the > +JSON description since it only applies to the proxy object in the QEMU > +process. > + > +device object whitelist > +^^^^^^^^^^^^^^^^^^^^^^^ > + > +Some device objects may not need a proxy. These are devices with no > +direct guest interfaces. (e.g., no MMIO, PIO, or interrupts). There will > +be a whitelist of such devices, and any devices on this list will not be > +instantiated in QEMU. Their JSON representation will still be sent to > +the remote process, so the object can be created there. > + > +object initialization > +^^^^^^^^^^^^^^^^^^^^^ > + > +QEMU object initialization occurs in two phases. The first > +initialization happens once per object class. (i.e., there can be many > +SCSI disks in an emulated system, but the "scsi-hd" class has its > +``class_init()`` function called only once) The second phase happens > +when each object's ``instance_init()`` function is called to initialize > +each instance of the object. > + > +All device objects are sub-classes of the "device" class, so they also > +have a ``realize()`` function that is called after ``instance_init()`` > +is called and after the object's static properties have been > +initialized. Many device objects don't even provide an instance\_init() > +function, and do all their per-instance work in ``realize()``. > + > +class\_init > +''''''''''' > + > +The ``class_init()`` method of a proxy object will, in general behave > +similarly to the object it replaces, including setting any static > +properties and methods needed by the proxy. > + > +instance\_init / realize > +'''''''''''''''''''''''' > + > +The ``instance_init()`` and ``realize()`` functions would only need to > +perform tasks related to being a proxy, such are registering its own > +MMIO handlers, or creating a child bus that other proxy devices can be > +attached to later. > + > +Other tasks will are device-specific. For example, PCI device objects > +will initialize the PCI config space in order to make a valid PCI device > +tree within the QEMU process. > + > +address space registration > +^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Most devices are driven by guest device driver accesses to IO addresses > +or ports. The QEMU device emulation code uses QEMU's memory region > +function calls (such as ``memory_region_init_io()``) to add callback > +functions that QEMU will invoke when the guest accesses the device's > +areas of the IO address space. When a guest driver does access the > +device, the VM will exit HW virtualization mode and return to QEMU, > +which will then lookup and execute the corresponding callback function. > + > +A proxy object would need to mirror the memory region calls the actual > +device emulator would perform in its initialization code, but with its > +own callbacks. When invoked by QEMU as a result of a guest IO operation, > +they will forward the operation to the device emulation process. > + > +PCI config space > +^^^^^^^^^^^^^^^^ > + > +PCI devices also have a configuration space that can be accessed by the > +guest driver. Guest accesses to this space is not handled by the device > +emulation object, but by its PCI parent object. Much of this space is > +read-only, but certain registers (especially BAR and MSI-related ones) > +need to be propagated to the emulation process. > + > +PCI parent proxy > +'''''''''''''''' > + > +One way to propagate guest PCI config accesses is to create a > +"pci-device-proxy" class that can serve as the parent of a PCI device > +proxy object. This class's parent would be "pci-device" and it would > +override the PCI parent's ``config_read()`` and ``config_write()`` > +methods with ones that forward these operations to the emulation > +program. > + > +interrupt receipt > +^^^^^^^^^^^^^^^^^ > + > +A proxy for a device that generates interrupts will need to create a > +socket to receive interrupt indications from the emulation process. An > +incoming interrupt indication would then be sent up to its bus parent to > +be injected into the guest. For example, a PCI device object may use > +``pci_set_irq()``. > + > +live migration > +^^^^^^^^^^^^^^ > + > +The proxy will register to save and restore any *vmstate* it needs over > +a live migration event. The device proxy does not need to manage the > +remote device's *vmstate*; that will be handled by the remote process > +proxy (see below). > + > +QEMU remote device operation > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Generic device operations, such as DMA, will be performs by the remote > +process proxy by sending messages to the remote process. > + > +DMA operations > +^^^^^^^^^^^^^^ > + > +DMA operations would be handled much like vhost applications do. One of > +the initial messages sent to the emulation process is a guest memory > +table. Each entry in this table consists of a file descriptor and size > +that the emulation process can ``mmap()`` to directly access guest > +memory, similar to ``vhost_user_set_mem_table()``. Note guest memory > +must be backed by file descriptors, such as when QEMU is given the > +*-mem-path* command line option. > + > +IOMMU operations > +^^^^^^^^^^^^^^^^ > + > +When the emulated system includes an IOMMU, the remote process proxy in > +QEMU will need to create a socket for IOMMU requests from the emulation > +process. It will handle those requests with an > +``address_space_get_iotlb_entry()`` call. In order to handle IOMMU > +unmaps, the remote process proxy will also register as a listener on the > +device's DMA address space. When an IOMMU memory region is created > +within the DMA address space, an IOMMU notifier for unmaps will be added > +to the memory region that will forward unmaps to the emulation process > +over the IOMMU socket. > + > +device hot-plug via QMP > +^^^^^^^^^^^^^^^^^^^^^^^ > + > +An QMP "device\_add" command can add a device emulated by a remote > +process. It needs to add a "remote" option to the command, just as the > +*-device* command line option does. The remote process may either be one > +started at QEMU startup, or be one added by the "add-process" QMP > +command described above. In either case, the remote process proxy will > +forward the new device's JSON description to the corresponding emulation > +process. > + > +live migration > +^^^^^^^^^^^^^^ > + > +The remote process proxy will also register for live migration > +notifications with ``vmstate_register()``. When called to save state, > +the proxy will send the remote process a secondary socket file > +descriptor to save the remote process's device *vmstate* over. The > +incoming byte stream length and data will be saved as the proxy's > +*vmstate*. When the proxy is resumed on its new host, this *vmstate* > +will be extracted, and a secondary socket file descriptor will be sent > +to the new remote process through which it receives the *vmstate* in > +order to restore the devices there. > + > +device emulation in remote process > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +The parts of QEMU that the emulation program will need include the > +object model; the memory emulation objects; the device emulation objects > +of the targeted device, and any dependent devices; and, the device's > +backends. It will also need code to setup the machine environment, > +handle requests from the QEMU process, and route machine-level requests > +(such as interrupts or IOMMU mappings) back to the QEMU process. > + > +initialization > +'''''''''''''' > + > +The process initialization sequence will follow the same sequence > +followed by QEMU. It will first initialize the backend objects, then > +device emulation objects. The JSON descriptions sent by the QEMU process > +will drive which objects need to be created. > + > +- address spaces > + > +Before the device objects are created, the initial address spaces and > +memory regions must be configured with ``memory_map_init()``. This > +creates a RAM memory region object (*system\_memory*) and an IO memory > +region object (*system\_io*). > + > +- RAM > + > +RAM memory region creation will follow how ``pc_memory_init()`` creates > +them, but must use ``memory_region_init_ram_from_fd()`` instead of > +``memory_region_allocate_system_memory()``. The file descriptors needed > +will be supplied by the guest memory table from above. Those RAM regions > +would then be added to the *system\_memory* memory region with > +``memory_region_add_subregion()``. > + > +- PCI > + > +IO initialization will be driven by the JSON descriptions sent from the > +QEMU process. For a PCI device, a PCI bus will need to be created with > +``pci_root_bus_new()``, and a PCI memory region will need to be created > +and added to the *system\_memory* memory region with > +``memory_region_add_subregion_overlap()``. The overlap version is > +required for architectures where PCI memory overlaps with RAM memory. > + > +MMIO handling > +''''''''''''' > + > +The device emulation objects will use ``memory_region_init_io()`` to > +install their MMIO handlers, and ``pci_register_bar()`` to associate > +those handlers with a PCI BAR, as they do within QEMU currently. > + > +In order to use ``address_space_rw()`` in the emulation process to > +handle MMIO requests from QEMU, the PCI physical addresses must be the > +same in the QEMU process and the device emulation process. In order to > +accomplish that, guest BAR programming must also be forwarded from QEMU > +to the emulation process. > + > +interrupt injection > +''''''''''''''''''' > + > +When device emulation wants to inject an interrupt into the VM, the > +request climbs the device's bus object hierarchy until the point where a > +bus object knows how to signal the interrupt to the guest. The details > +depend on the type of interrupt being raised. > + > +- PCI pin interrupts > + > +On x86 systems, there is an emulated IOAPIC object attached to the root > +PCI bus object, and the root PCI object forwards interrupt requests to > +it. The IOAPIC object, in turn, calls the KVM driver to inject the > +corresponding interrupt into the VM. The simplest way to handle this in > +an emulation process would be to setup the root PCI bus driver (via > +``pci_bus_irqs()``) to send a interrupt request back to the QEMU > +process, and have the device proxy object reflect it up the PCI tree > +there. > + > +- PCI MSI/X interrupts > + > +PCI MSI/X interrupts are implemented in HW as DMA writes to a > +CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives > +these DMA writes, then calls into the KVM driver to inject the interrupt > +into the VM. A simple emulation process implementation would be to send > +the MSI DMA address from QEMU as a message at initialization, then > +install an address space handler at that address which forwards the MSI > +message back to QEMU. > + > +DMA operations > +'''''''''''''' > + > +When a emulation object wants to DMA into or out of guest memory, it > +first must use dma\_memory\_map() to convert the DMA address to a local > +virtual address. The emulation process memory region objects setup above > +will be used to translate the DMA address to a local virtual address the > +device emulation code can access. > + > +IOMMU > +''''' > + > +When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory > +regions to translate the DMA address to a guest physical address before > +that physical address can be translated to a local virtual address. The > +emulation process will need similar functionality. > + > +- IOTLB cache > + > +The emulation process will maintain a cache of recent IOMMU translations > +(the IOTLB). When the translate() callback of an IOMMU memory region is > +invoked, the IOTLB cache will be searched for an entry that will map the > +DMA address to a guest PA. On a cache miss, a message will be sent back > +to QEMU requesting the corresponding translation entry, which be both be > +used to return a guest address and be added to the cache. > + > +- IOTLB purge > + > +The IOMMU emulation will also need to act on unmap requests from QEMU. > +These happen when the guest IOMMU driver purges an entry from the > +guest's translation table. > + > +live migration > +'''''''''''''' > + > +When a remote process receives a live migration indication from QEMU, it > +will set up a channel using the received file descriptor with > +``qio_channel_socket_new_fd()``. This channel will be used to create a > +*QEMUfile* that can be passed to ``qemu_save_device_state()`` to send > +the process's device state back to QEMU. This method will be reversed on > +restore - the channel will be passed to ``qemu_loadvm_state()`` to > +restore the device state. > + > +Accelerating device emulation > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +The messages that are required to be sent between QEMU and the emulation > +process can add considerable latency to IO operations. The optimizations > +described below attempt to ameliorate this effect by allowing the > +emulation process to communicate directly with the kernel KVM driver. > +The KVM file descriptors created wold be passed to the emulation process > +via initialization messages, much like the guest memory table is done. > +#### MMIO acceleration > + > +Vhost user applications can receive guest virtio driver stores directly > +from KVM. The issue with the eventfd mechanism used by vhost user is > +that it does not pass any data with the event indication, so it cannot > +handle guest loads or guest stores that carry store data. This concept > +could, however, be expanded to cover more cases. > + > +The expanded idea would require a new type of KVM device: > +*KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master > +descriptor that QEMU can use for configuration, and a slave descriptor > +that the emulation process can use to receive MMIO notifications. QEMU > +would create both descriptors using the KVM driver, and pass the slave > +descriptor to the emulation process via an initialization message. > + > +data structures > +''''''''''''''' > + > +- guest physical range > + > +The guest physical range structure describes the address range that a > +device will respond to. It includes the base and length of the range, as > +well as which bus the range resides on (e.g., on an x86machine, it can > +specify whether the range refers to memory or IO addresses). > + > +A device can have multiple physical address ranges it responds to (e.g., > +a PCI device can have multiple BARs), so the structure will also include > +an enumerated identifier to specify which of the device's ranges is > +being referred to. > + > ++--------+----------------------------+ > +| Name | Description | > ++========+============================+ > +| addr | range base address | > ++--------+----------------------------+ > +| len | range length | > ++--------+----------------------------+ > +| bus | addr type (memory or IO) | > ++--------+----------------------------+ > +| id | range ID (e.g., PCI BAR) | > ++--------+----------------------------+ > + > +- MMIO request structure > + > +This structure describes an MMIO operation. It includes which guest > +physical range the MMIO was within, the offset within that range, the > +MMIO type (e.g., load or store), and its length and data. It also > +includes a sequence number that can be used to reply to the MMIO, and > +the CPU that issued the MMIO. > + > ++----------+------------------------+ > +| Name | Description | > ++==========+========================+ > +| rid | range MMIO is within | > ++----------+------------------------+ > +| offset | offset withing *rid* | > ++----------+------------------------+ > +| type | e.g., load or store | > ++----------+------------------------+ > +| len | MMIO length | > ++----------+------------------------+ > +| data | store data | > ++----------+------------------------+ > +| seq | sequence ID | > ++----------+------------------------+ > + > +- MMIO request queues > + > +MMIO request queues are FIFO arrays of MMIO request structures. There > +are two queues: pending queue is for MMIOs that haven't been read by the > +emulation program, and the sent queue is for MMIOs that haven't been > +acknowledged. The main use of the second queue is to validate MMIO > +replies from the emulation program. > + > +- scoreboard > + > +Each CPU in the VM is emulated in QEMU by a separate thread, so multiple > +MMIOs may be waiting to be consumed by an emulation program and multiple > +threads may be waiting for MMIO replies. The scoreboard would contain a > +wait queue and sequence number for the per-CPU threads, allowing them to > +be individually woken when the MMIO reply is received from the emulation > +program. It also tracks the number of posted MMIO stores to the device > +that haven't been replied to, in order to satisfy the PCI constraint > +that a load to a device will not complete until all previous stores to > +that device have been completed. > + > +- device shadow memory > + > +Some MMIO loads do not have device side-effects. These MMIOs can be > +completed without sending a MMIO request to the emulation program if the > +emulation program shares a shadow image of the device's memory image > +with the KVM driver. > + > +The emulation program will ask the KVM driver to allocate memory for the > +shadow image, and will then use ``mmap()`` to directly access it. The > +emulation program can control KVM access to the shadow image by sending > +KVM an access map telling it which areas of the image have no > +side-effects (and can be completed immediately), and which require a > +MMIO request to the emulation program. The access map can also inform > +the KVM drive which size accesses are allowed to the image. > + > +master descriptor > +''''''''''''''''' > + > +The master descriptor is used by QEMU to configure the new KVM device. > +The descriptor would be returned by the KVM driver when QEMU issues a > +*KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type. > + > +KVM\_DEV\_TYPE\_USER device ops > + > + > +The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a > +``kvm_register_device_ops()`` call when the KVM system in initialized by > +``kvm_init()``. These device ops are called by the KVM driver when QEMU > +executes certain ``ioctl()`` operations on its KVM file descriptor. They > +include: > + > +- create > + > +This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE* > +``ioctl()`` on its per-VM file descriptor. It will allocate and > +initialize a KVM user device specific data structure, and assign the > +*kvm\_device* private field to it. > + > +- ioctl > + > +This routine is invoked when QEMU issues an ``ioctl()`` on the master > +descriptor. The ``ioctl()`` commands supported are defined by the KVM > +device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands: > + > +*KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor thatwill > +be passed to the device emulation program. Only one slave can be created > +by each master descriptor. The file operations performed by this > +descriptor are described below. > + > +The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical > +address range that the slave descriptor will receive MMIO notifications > +for. The range is specified by a guest physical range structure > +argument. For buses that assign addresses to devices dynamically, this > +command can be executed while the guest is running, such as the case > +when a guest changes a device's PCI BAR registers. > + > +*KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to > +register *kvm\_io\_device\_ops* callbacks to be invoked when the guest > +performs a MMIO operation within the range. When a range is changed, > +``kvm_io_bus_unregister_dev()`` is used to remove the previous > +instantiation. > + > +*KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies > +how long KVM will wait for the emulation process to respond to a MMIO > +indication. > + > +- destroy > + > +This routine is called when the VM instance is destroyed. It will need > +to destroy the slave descriptor; and free any memory allocated by the > +driver, as well as the *kvm\_device* structure itself. > + > +slave descriptor > +'''''''''''''''' > + > +The slave descriptor will have its own file operations vector, which > +responds to system calls on the descriptor performed by the device > +emulation program. > + > +- read > + > +A read returns any pending MMIO requests from the KVM driver as MMIO > +request structures. Multiple structures can be returned if there are > +multiple MMIO operations pending. The MMIO requests are moved from the > +pending queue to the sent queue, and if there are threads waiting for > +space in the pending to add new MMIO operations, they will be woken > +here. > + > +- write > + > +A write also consists of a set of MMIO requests. They are compared to > +the MMIO requests in the sent queue. Matches are removed from the sent > +queue, and any threads waiting for the reply are woken. If a store is > +removed, then the number of posted stores in the per-CPU scoreboard is > +decremented. When the number is zero, and a non side-effect load was > +waiting for posted stores to complete, the load is continued. > + > +- ioctl > + > +There are several ioctl()s that can be performed on the slave > +descriptor. > + > +A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to > +allocate memory for the shadow image. This memory can later be > +``mmap()``\ ed by the emulation process to share the emulation's view of > +device memory with the KVM driver. > + > +A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the > +shadow image. It will send the KVM driver a shadow control map, which > +specifies which areas of the image can complete guest loads without > +sending the load request to the emulation program. It will also specify > +the size of load operations that are allowed. > + > +- poll > + > +An emulation program will use the ``poll()`` call with a *POLLIN* flag > +to determine if there are MMIO requests waiting to be read. It will > +return if the pending MMIO request queue is not empty. > + > +- mmap > + > +This call allows the emulation program to directly access the shadow > +image allocated by the KVM driver. As device emulation updates device > +memory, changes with no side-effects will be reflected in the shadow, > +and the KVM driver can satisfy guest loads from the shadow image without > +needing to wait for the emulation program. > + > +kvm\_io\_device ops > +''''''''''''''''''' > + > +Each KVM per-CPU thread can handle MMIO operation on behalf of the guest > +VM. KVM will use the MMIO's guest physical address to search for a > +matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM > +driver instead of exiting back to QEMU. If a match is found, the > +corresponding callback will be invoked. > + > +- read > + > +This callback is invoked when the guest performs a load to the device. > +Loads with side-effects must be handled synchronously, with the KVM > +driver putting the QEMU thread to sleep waiting for the emulation > +process reply before re-starting the guest. Loads that do not have > +side-effects may be optimized by satisfying them from the shadow image, > +if there are no outstanding stores to the device by this CPU. PCI memory > +ordering demands that a load cannot complete before all older stores to > +the same device have been completed. > + > +- write > + > +Stores can be handled asynchronously unless the pending MMIO request > +queue is full. In this case, the QEMU thread must sleep waiting for > +space in the queue. Stores will increment the number of posted stores in > +the per-CPU scoreboard, in order to implement the PCI ordering > +constraint above. > + > +interrupt acceleration > +^^^^^^^^^^^^^^^^^^^^^^ > + > +This performance optimization would work much like a vhost user > +application does, where the QEMU process sets up *eventfds* that cause > +the device's corresponding interrupt to be triggered by the KVM driver. > +These irq file descriptors are sent to the emulation process at > +initialization, and are used when the emulation code raises a device > +interrupt. > + > +intx acceleration > +''''''''''''''''' > + > +Traditional PCI pin interrupts are level based, so, in addition to an > +irq file descriptor, a re-sampling file descriptor needs to be sent to > +the emulation program. This second file descriptor allows multiple > +devices sharing an irq to be notified when the interrupt has been > +acknowledged by the guest, so they can re-trigger the interrupt if their > +device has not de-asserted its interrupt. > + > +intx irq descriptor > + > + > +The irq descriptors are created by the proxy object > +``using event_notifier_init()`` to create the irq and re-sampling > +*eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt. > +The interrupt route can be found with > +``pci_device_route_intx_to_irq()``. > + > +intx routing changes > + > + > +Intx routing can be changed when the guest programs the APIC the device > +pin is connected to. The proxy object in QEMU will use > +``pci_device_set_intx_routing_notifier()`` to be informed of any guest > +changes to the route. This handler will broadly follow the VFIO > +interrupt logic to change the route: de-assigning the existing irq > +descriptor from its route, then assigning it the new route. (see > +``vfio_intx_update()``) > + > +MSI/X acceleration > +'''''''''''''''''' > + > +MSI/X interrupts are sent as DMA transactions to the host. The interrupt > +data contains a vector that is programed by the guest, A device may have > +multiple MSI interrupts associated with it, so multiple irq descriptors > +may need to be sent to the emulation program. > + > +MSI/X irq descriptor > + > + > +This case will also follow the VFIO example. For each MSI/X interrupt, > +an *eventfd* is created, a virtual interrupt is allocated by > +``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to > +the eventfd with ``kvm_irqchip_add_irqfd_notifier()``. > + > +MSI/X config space changes > + > + > +The guest may dynamically update several MSI-related tables in the > +device's PCI config space. These include per-MSI interrupt enables and > +vector data. Additionally, MSIX tables exist in device memory space, not > +config space. Much like the BAR case above, the proxy object must look > +at guest config space programming to keep the MSI interrupt state > +consistent between QEMU and the emulation program. > + > +-------------- > + > +Disaggregated CPU emulation > +--------------------------- > + > +After IO services have been disaggregated, a second phase would be to > +separate a process to handle CPU instruction emulation from the main > +QEMU control function. There are no object separation points for this > +code, so the first task would be to create one. > + > +Host access controls > +-------------------- > + > +Separating QEMU relies on the host OS's access restriction mechanisms to > +enforce that the differing processes can only access the objects they > +are entitled to. There are a couple types of mechanisms usually provided > +by general purpose OSs. > + > +Discretionary access control > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Discretionary access control allows each user to control who can access > +their files. In Linux, this type of control is usually too coarse for > +QEMU separation, since it only provides three separate access controls: > +one for the same user ID, the second for users IDs with the same group > +ID, and the third for all other user IDs. Each device instance would > +need a separate user ID to provide access control, which is likely to be > +unwieldy for dynamically created VMs. > + > +Mandatory access control > +~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Mandatory access control allows the OS to add an additional set of > +controls on top of discretionary access for the OS to control. It also > +adds other attributes to processes and files such as types, roles, and > +categories, and can establish rules for how processes and files can > +interact. > + > +Type enforcement > +^^^^^^^^^^^^^^^^ > + > +Type enforcement assigns a *type* attribute to processes and files, and > +allows rules to be written on what operations a process with a given > +type can perform on a file with a given type. QEMU separation could take > +advantage of type enforcement by running the emulation processes with > +different types, both from the main QEMU process, and from the emulation > +processes of different classes of devices. > + > +For example, guest disk images and disk emulation processes could have > +types separate from the main QEMU process and non-disk emulation > +processes, and the type rules could prevent processes other than disk > +emulation ones from accessing guest disk images. Similarly, network > +emulation processes can have a type separate from the main QEMU process > +and non-network emulation process, and only that type can access the > +host tun/tap device used to provide guest networking. > + > +Category enforcement > +^^^^^^^^^^^^^^^^^^^^ > + > +Category enforcement assigns a set of numbers within a given range to > +the process or file. The process is granted access to the file if the > +process's set is a superset of the file's set. This enforcement can be > +used to separate multiple instances of devices in the same class. > + > +For example, if there are multiple disk devices provides to a guest, > +each device emulation process could be provisioned with a separate > +category. The different device emulation processes would not be able to > +access each other's backing disk images. > + > +Alternatively, categories could be used in lieu of the type enforcement > +scheme described above. In this scenario, different categories would be > +used to prevent device emulation processes in different classes from > +accessing resources assigned to other classes. > -- > 1.8.3.1 >
On Thu, Oct 24, 2019 at 05:09:29AM -0400, Jagannathan Raman wrote: > diff --git a/docs/devel/qemu-multiprocess.rst b/docs/devel/qemu-multiprocess.rst > new file mode 100644 > index 0000000..2c42c6e > --- /dev/null > +++ b/docs/devel/qemu-multiprocess.rst > @@ -0,0 +1,1102 @@ > +Disaggregating QEMU > +=================== > + > +QEMU is often used as the hypervisor for virtual machines running in the > +Oracle cloud. Since one of the advantages of cloud computing is the > +ability to run many VMs from different tenants in the same cloud > +infrastructure, a guest that compromised its hypervisor could > +potentially use the hypervisor's access privileges to access data it is > +not authorized for. > + > +QEMU can be susceptible to security attack because it is a large, > +monolithic program that provides many features to the VMs it services. > +Many of these feature can be configured out of QEMU, but even a reduced > +configuration QEMU has a large amount of code a guest can potentially > +attack in order to gain additional privileges. The "additional privileges" are only host userspace code execution (i.e. syscalls) within an unprivileged process that is sandboxed using seccomp and SELinux on a properly configured system. If QEMU has access to resources that do not belong to the guest then you have not configured QEMU correctly (libvirt handles a lot of this setup for you). I think it's more accurate to describe the motivation for multi-process QEMU in terms of the principle of least privilege: each component in the system should only have access to the resources that it needs to perform its job. That way people don't get the impression that QEMU is a trusted component with access to resources that must be kept from the guest. > +QEMU services > +------------- > + > +QEMU can be broadly described as providing three main services. One is a > +VM control point, where VMs can be created, migrated, re-configured, and > +destroyed. A second is to emulate the CPU instructions within the VM, > +often accelerated by HW virtualization features such as Intel's VT > +extensions. Finally, it provides IO services to the VM by emulating HW > +IO devices, such as disk and network devices. > + > +A disaggregated QEMU > +~~~~~~~~~~~~~~~~~~~~ > + > +A disaggregated QEMU involves separating QEMU services into separate > +host processes. Each of these processes can be given only the privileges > +it needs to provide its service, e.g., a disk service could be given > +access only the the disk images it provides, and not be allowed to > +access other files, or any network devices. An attacker who compromised > +this service would not be able to use this exploit to access files or > +devices beyond what the disk service was given access to. > + > +A QEMU control process would remain, but in disaggregated mode, it would > +be a control point that executes the processes needed to support the VM > +being created, but have no direct interfaces to the VM. During VM > +execution, it would still provide the user interface to hot-plug devices > +or live migrate the VM. "it would be a control point that executes the processes needed to support the VM being created" libvirt does the sandboxing setup. I think the responsibility of executing and sandboxing device processes would also be left to libvirt, not to QEMU. Perhaps it's best to leave this sentence out and enable both approaches (1. QEMU executes device processes, 2. management tool executes device processes). > +A first step in creating a disaggregated QEMU is to separate IO services > +from the main QEMU program, which would continue to provide CPU > +emulation. i.e., the control process would also be the CPU emulation > +process. In a later phase, CPU emulation could be separated from the > +control process. > + > +Disaggregating IO services > +-------------------------- > + > +Disaggregating IO services is a good place to begin QEMU disaggregating > +for a couple of reasons. One is the sheer number of IO devices QEMU can > +emulate provides a large surface of interfaces which could potentially > +be exploited, and, indeed, have been a source of exploits in the past. > +Another is the modular nature of QEMU device emulation code provides > +interface points where the QEMU functions that perform device emulation > +can be separated from the QEMU functions that manage the emulation of > +guest CPU instructions. > + > +QEMU device emulation > +~~~~~~~~~~~~~~~~~~~~~ > + > +QEMU uses a object oriented SW architecture for device emulation code. > +Configured objects are all compiled into the QEMU binary, then objects > +are instantiated by name when used by the guest VM. For example, the > +code to emulate a device named "foo" is always present in QEMU, but its > +instantiation code is only run when the device is included in the target > +VM. (e.g., via the QEMU command line as *-device foo*) > + > +The object model is hierarchical, so device emulation code names its > +parent object (such as "pci-device" for a PCI device) and QEMU will > +instantiate a parent object before calling the device's instantiation > +code. > + > +Current separation models > +~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +In order to separate the device emulation code from the CPU emulation > +code, the device object code must run in a different process. There are > +a couple of existing QEMU features that can run emulation code > +separately from the main QEMU process. These are examined below. > + > +vhost user model > +^^^^^^^^^^^^^^^^ > + > +Virtio guest device drivers can be connected to vhost user applications > +in order to perform their IO operations. This model uses special virtio > +device drivers in the guest and vhost user device objects in QEMU, but > +once the QEMU vhost user code has configured the vhost user application, > +mission-mode IO is performed by the application. The vhost user > +application is a daemon process that can be contacted via a known UNIX > +domain socket. > + > +vhost socket > +'''''''''''' > + > +As mentioned above, one of the tasks of the vhost device object within > +QEMU is to contact the vhost application and send it configuration > +information about this device instance. As part of the configuration > +process, the application can also be sent other file descriptors over > +the socket, which then can be used by the vhost user application in > +various ways, some of which are described below. > + > +vhost MMIO store acceleration > +''''''''''''''''''''''''''''' > + > +VMs are often run using HW virtualization features via the KVM kernel > +driver. This driver allows QEMU to accelerate the emulation of guest CPU > +instructions by running the guest in a virtual HW mode. When the guest > +executes instructions that cannot be executed by virtual HW mode, > +execution returns to the KVM driver so it can inform QEMU to emulate the > +instructions in SW. > + > +One of the events that can cause a return to QEMU is when a guest device > +driver accesses an IO location. QEMU then dispatches the memory > +operation to the corresponding QEMU device object. In the case of a > +vhost user device, the memory operation would need to be sent over a > +socket to the vhost application. This path is accelerated by the QEMU > +virtio code by setting up an eventfd file descriptor that the vhost > +application can directly receive MMIO store notifications from the KVM > +driver, instead of needing them to be sent to the QEMU process first. > + > +vhost interrupt acceleration > +'''''''''''''''''''''''''''' > + > +Another optimization used by the vhost application is the ability to > +directly inject interrupts into the VM via the KVM driver, again, > +bypassing the need to send the interrupt back to the QEMU process first. > +The QEMU virtio setup code configures the KVM driver with an eventfd > +that triggers the device interrupt in the guest when the eventfd is > +written. This irqfd file descriptor is then passed to the vhost user > +application program. > + > +vhost access to guest memory > +'''''''''''''''''''''''''''' > + > +The vhost application is also allowed to directly access guest memory, > +instead of needing to send the data as messages to QEMU. This is also > +done with file descriptors sent to the vhost user application by QEMU. > +These descriptors can be passed to ``mmap()`` by the vhost application > +to map the guest address space into the vhost application. > + > +IOMMUs introduce another level of complexity, since the address given to > +the guest virtio device to DMA to or from is not a guest physical > +address. This case is handled by having vhost code within QEMU register > +as a listener for IOMMU mapping changes. The vhost application maintains > +a cache of IOMMMU translations: sending translation requests back to > +QEMU on cache misses, and in turn receiving flush requests from QEMU > +when mappings are purged. > + > +applicability to device separation > +'''''''''''''''''''''''''''''''''' > + > +Much of the vhost model can be re-used by separated device emulation. In > +particular, the ideas of using a socket between QEMU and the device > +emulation application, using a file descriptor to inject interrupts into > +the VM via KVM, and allowing the application to ``mmap()`` the guest > +should be re used. > + > +There are, however, some notable differences between how a vhost > +application works and the needs of separated device emulation. The most > +basic is that vhost uses custom virtio device drivers which always > +trigger IO with MMIO stores. A separated device emulation model must > +work with existing IO device models and guest device drivers. MMIO loads > +break vhost store acceleration since they are synchronous - guest > +progress cannot continue until the load has been emulated. By contrast, > +stores are asynchronous, the guest can continue after the store event > +has been sent to the vhost application. > + > +Another difference is that in the vhost user model, a single daemon can > +support multiple QEMU instances. This is contrary to the security regime > +desired, in which the emulation application should only be allowed to > +access the files or devices the VM it's running on behalf of can access. > +#### qemu-io model > + > +Qemu-io is a test harness used to test changes to the QEMU block backend > +object code. (e.g., the code that implements disk images for disk driver > +emulation) Qemu-io is not a device emulation application per se, but it > +does compile the QEMU block objects into a separate binary from the main > +QEMU one. This could be useful for disk device emulation, since its > +emulation applications will need to include the QEMU block objects. > + > +New separation model based on proxy objects > +------------------------------------------- > + > +A different model based on proxy objects in the QEMU program > +communicating with remote emulation programs could provide separation > +while minimizing the changes needed to the device emulation code. The > +rest of this section is a discussion of how a proxy object model would > +work. > + > +Remote emulation processes > +~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +The remote emulation process will run the QEMU object hierarchy without > +modification. The device emulation objects will be also be based on the > +QEMU code, because for anything but the simplest device, it would not be > +a tractable to re-implement both the object model and the many device > +backends that QEMU has. > + > +The processes will communicate with the QEMU process over UNIX domain > +sockets. The processes can be executed either as standalone processes, > +or be executed by QEMU. In both cases, the host backends the emulation > +processes will provide are specified on its command line, as they would > +be for QEMU. For example: > + > +:: > + > + disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \ > + -blockdev driver=qcow2,node-name=drive0,file=file0 > + > +would indicate process *disk-proc* uses a qcow2 emulated disk named > +*file0* as its backend. > + > +Emulation processes may emulate more than one guest controller. A common > +configuration might be to put all controllers of the same device class > +(e.g., disk, network, etc.) in a single process, so that all backends of > +the same type can be managed by a single QMP monitor. > + > +communication with QEMU > +^^^^^^^^^^^^^^^^^^^^^^^ > + > +Remote emulation processes will recognize a *-socket* argument that > +specifies the path of a UNIX domain socket used to communicate with the > +QEMU process. If no *-socket* argument is present, the process will use > +file descriptor 0 to communicate with QEMU. For example, > + > +:: > + > + disk-proc -socket /tmp/disk0-sock <backend list> > + > +will communicate with QEMU using the socket path */tmp/dik0-sock*. s/dik/disk/ > + > +remote process QMP monitor > +^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Remote emulation processes can be monitored via QMP, similar to QEMU > +itself. The QMP monitor socket is specified the same as for a QEMU > +process: > + > +:: > + > + disk-proc -qmp unix:/tmp/disk-mon,server > + > +can be monitored over the UNIX socket path */tmp/disk-mon*. > + > +QEMU command line > +~~~~~~~~~~~~~~~~~ > + > +The QEMU command line options will need to be modified to indicate which > +items are emulated by a separate program, and which remain emulated by > +QEMU itself. > + > +identifying remote emulation processes > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Remote emulation processes will be identified to QEMU using a *-remote* > +command line option. This option can either specify a command that QEMU > +will execute, or can specify a UNIX domain socket that QEMU can use to > +connect to an existing process. Both forms require a "id" option that > +identifies the process to later *-device* options. The process version > +is: > + > +:: > + > + -remote id=disk-proc,command="disk-proc <backend list>" > + > +And the socket version is: > + > +:: > + > + -remote id=disk-proc,socket="/tmp/disk0-sock" > + > +In the latter case, the remote process must be given the same socket on > +its command line when it is executed: > + > +:: > + > + disk-proc -socket /tmp/disk0-sock <backend list> > + > +identifying devices emulated remotely > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Devices that are to be emulated in a separate process will be identify s/be// > +the remote process with a "remote" option on their *-device* command > +line specification. e.g., an LSI SCSI controller and disk can be > +specified as: > + > +:: > + > + -device lsi53c895a,id=scsi0 > + -device scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0 > + > +If these devices are emulated by remote process "disk-proc," as > +described in the previous section, the QEMU command line would be: > + > +:: > + > + -device lsi53c895a,id=scsi0,remote=disk-proc > + -device scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0,remote=disk-proc The next patch documents rid=. This seems to be the same as remote=? Please use remote= everywhere. > + > +Some devices are implicitly created by the machine object. e.g., the q35 > +machine object will create its PCI bus, and attach an ich9-ahci IDE > +controller to it. In this case, options will need to be added to the > +*-machine* command line. e.g., > + > +:: > + > + -machine pc-q35,ide-remote=disk-proc > + > +will use the remote process with an "id" of "disk-proc" to emulate the > +IDE controller and its disks. It might be possible to avoid introducing special-purpose *-remote= parameters using the -set command-line option. If you know the id of the on-board device then you can set the remote= property on it: -set piix4-ide.ide0.remote=disk-proc I haven't tried this but if it works then no code changes are required. > +The disks themselves still need to be specified with *-remote* option, > +as in the example above. e.g., > + > +:: > + > + -device ide-hd,drive=drive0,bus=ide.0,unit=0,remote=disk-proc > + > +QEMU management of remote processes > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Each *-remote* instance on the QEMU command line will create a remote > +process proxy instance in QEMU. They will be held on a *QList* that can > +be searched for by its "id" property. The remote process proxy will also > +establish a communication channel between QEMU and the remote process. > +This can be done in one of two methods: direction execution of the > +process by QEMU with ``fork()`` and ``exec()`` system calls, or by > +connecting to an existing process. > + > +direct execution > +^^^^^^^^^^^^^^^^ > + > +When the remote process is directly executed, the remote process proxy > +will setup a communication channel between itself and the emulation > +process. This channel will be created using ``socketpair()`` and the > +remote process side of the pair will be given to the process as file > +descriptor 0. > + > +connecting to an existing process > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Some environments wish to deny QEMU the ability to execute ``fork()`` > +and ``exec()`` In these case, emulation processes will be started before > +QEMU, and a UNIX domain socket will be given to each emulation process > +to communicate with QEMU over. After communication is established, the > +socket will be unlinked from the file system space by the QEMU process. > + > +communication with emulation process > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +primary socket > +'''''''''''''' > + > +Whether the process was executed by QEMU or externally, there will be a > +primary socket for communication between QEMU and the remote process. > +This channel will handle configuration commands from QEMU to the > +process, either from the QEMU command line, or from QMP commands that > +affect the devices being emulated by the process. This channel will only > +allow one message to be pending at a time; if additional messages > +arrive, they must wait for previous ones to be acknowledged from the > +remote side. > + > +secondary sockets > +''''''''''''''''' > + > +The primary socket can pass the file descriptors of secondary sockets > +for operations that occur in parallel with commands on the primary > +channel. These include MMIO operations generated by the guest, interrupt > +notifications generated by the devices being emulated, or *vmstate* for > +live migration. These secondary sockets will be created at the behest of > +the device proxies that require them. A disk device proxy wouldn't need > +any secondary sockets, but a disk controller device proxy may need both > +an MMIO socket and an interrupt socket. > + > +emulation process attached via QMP command > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +There will be a new "attach-process" QMP command to facilitate device The QMP command name "remote-add" would be consistent with object-add. (There is also netdev_add and device_add but their names use underscores for legacy reasons.) > +hot-plug. This command's arguments will be the same as the *-remote* > +command line when it's used to attach to a remote process. i.e., it will > +need an "id" argument so that hot-plugged devices can later find it, and > +a "socket" argument to identify the UNIX domain socket that will be used > +to communicate with QEMU. > + > +QEMU device proxy objects > +~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +QEMU has an object model based on sub-classes inherited from the > +"object" super-class. The sub-classes that are of interest here are the > +"device" and "bus" sub-classes whose child sub-classes make up the > +device tree of a QEMU emulated system. > + > +The proxy object model will use device proxy objects to replace the > +device emulation code within the QEMU process. These objects will live > +in the same place in the object and bus hierarchies as the objects they > +replace. i.e., the proxy object for an LSI SCSI controller will be a > +sub-class of the "pci-device" class, and will have the same PCI bus > +parent and the same SCSI bus child objects as the LSI controller object > +it replaces. > + > +After the QEMU command line has been parsed, the remote devices will be > +instantiated in the same manner as local devices are. (i.e., > +``qdev_device_add()``). In order to distinguish them from regular > +*-device* device objects, their class name will be the name of the class > +it replaces, with "-proxy" appended. e.g., the "lsi53c895a" proxy class > +will be "lsi53c895a-proxy." Did you consider defining just -device pci-device-proxy,remote=ID and then transferring the device-specific details (e.g. PCI Configuration Space, BARs, and interrupt configuration) over the socket during initialization? That way it's not necessary to write proxy devices. There is just one PCI proxy device that automatically reflects the information from the device emulation process. > + > +device JSON description > +^^^^^^^^^^^^^^^^^^^^^^^ > + > +The remote process needs a JSON representation of the command line > +options used to create the object. This JSON representation is used to > +create the corresponding object in the emulation process. e.g., for an > +LSI SCSI controller invoked as: > + > +:: > + > + -device lsi53c895a,id=scsi0,remote=lsi-scsi > + > +the proxy object would create a > + > +:: > + > + { "driver" : "lsi53c895a", "id" : "scsi0" } > + > +JSON description. The "driver" option is assigned to the device name > +when the command line is parsed, so the "-proxy" appended by the command > +line parsing code is removed. The "remote" option isn't needed in the > +JSON description since it only applies to the proxy object in the QEMU > +process. > + > +device object whitelist > +^^^^^^^^^^^^^^^^^^^^^^^ > + > +Some device objects may not need a proxy. These are devices with no > +direct guest interfaces. (e.g., no MMIO, PIO, or interrupts). There will > +be a whitelist of such devices, and any devices on this list will not be > +instantiated in QEMU. Their JSON representation will still be sent to > +the remote process, so the object can be created there. > + > +object initialization > +^^^^^^^^^^^^^^^^^^^^^ > + > +QEMU object initialization occurs in two phases. The first > +initialization happens once per object class. (i.e., there can be many > +SCSI disks in an emulated system, but the "scsi-hd" class has its > +``class_init()`` function called only once) The second phase happens > +when each object's ``instance_init()`` function is called to initialize > +each instance of the object. > + > +All device objects are sub-classes of the "device" class, so they also > +have a ``realize()`` function that is called after ``instance_init()`` > +is called and after the object's static properties have been > +initialized. Many device objects don't even provide an instance\_init() > +function, and do all their per-instance work in ``realize()``. > + > +class\_init > +''''''''''' > + > +The ``class_init()`` method of a proxy object will, in general behave > +similarly to the object it replaces, including setting any static > +properties and methods needed by the proxy. > + > +instance\_init / realize > +'''''''''''''''''''''''' > + > +The ``instance_init()`` and ``realize()`` functions would only need to > +perform tasks related to being a proxy, such are registering its own > +MMIO handlers, or creating a child bus that other proxy devices can be > +attached to later. > + > +Other tasks will are device-specific. For example, PCI device objects > +will initialize the PCI config space in order to make a valid PCI device > +tree within the QEMU process. > + > +address space registration > +^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Most devices are driven by guest device driver accesses to IO addresses > +or ports. The QEMU device emulation code uses QEMU's memory region > +function calls (such as ``memory_region_init_io()``) to add callback > +functions that QEMU will invoke when the guest accesses the device's > +areas of the IO address space. When a guest driver does access the > +device, the VM will exit HW virtualization mode and return to QEMU, > +which will then lookup and execute the corresponding callback function. > + > +A proxy object would need to mirror the memory region calls the actual > +device emulator would perform in its initialization code, but with its > +own callbacks. When invoked by QEMU as a result of a guest IO operation, > +they will forward the operation to the device emulation process. > + > +PCI config space > +^^^^^^^^^^^^^^^^ > + > +PCI devices also have a configuration space that can be accessed by the > +guest driver. Guest accesses to this space is not handled by the device > +emulation object, but by its PCI parent object. Much of this space is > +read-only, but certain registers (especially BAR and MSI-related ones) > +need to be propagated to the emulation process. > + > +PCI parent proxy > +'''''''''''''''' > + > +One way to propagate guest PCI config accesses is to create a > +"pci-device-proxy" class that can serve as the parent of a PCI device > +proxy object. This class's parent would be "pci-device" and it would > +override the PCI parent's ``config_read()`` and ``config_write()`` > +methods with ones that forward these operations to the emulation > +program. > + > +interrupt receipt > +^^^^^^^^^^^^^^^^^ > + > +A proxy for a device that generates interrupts will need to create a > +socket to receive interrupt indications from the emulation process. An > +incoming interrupt indication would then be sent up to its bus parent to > +be injected into the guest. For example, a PCI device object may use > +``pci_set_irq()``. > + > +live migration > +^^^^^^^^^^^^^^ > + > +The proxy will register to save and restore any *vmstate* it needs over > +a live migration event. The device proxy does not need to manage the > +remote device's *vmstate*; that will be handled by the remote process > +proxy (see below). > + > +QEMU remote device operation > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Generic device operations, such as DMA, will be performs by the remote s/performs/performed/ > +process proxy by sending messages to the remote process. > + > +DMA operations > +^^^^^^^^^^^^^^ > + > +DMA operations would be handled much like vhost applications do. One of > +the initial messages sent to the emulation process is a guest memory > +table. Each entry in this table consists of a file descriptor and size > +that the emulation process can ``mmap()`` to directly access guest > +memory, similar to ``vhost_user_set_mem_table()``. Note guest memory > +must be backed by file descriptors, such as when QEMU is given the > +*-mem-path* command line option. > + > +IOMMU operations > +^^^^^^^^^^^^^^^^ > + > +When the emulated system includes an IOMMU, the remote process proxy in > +QEMU will need to create a socket for IOMMU requests from the emulation > +process. It will handle those requests with an > +``address_space_get_iotlb_entry()`` call. In order to handle IOMMU > +unmaps, the remote process proxy will also register as a listener on the > +device's DMA address space. When an IOMMU memory region is created > +within the DMA address space, an IOMMU notifier for unmaps will be added > +to the memory region that will forward unmaps to the emulation process > +over the IOMMU socket. > + > +device hot-plug via QMP > +^^^^^^^^^^^^^^^^^^^^^^^ > + > +An QMP "device\_add" command can add a device emulated by a remote > +process. It needs to add a "remote" option to the command, just as the > +*-device* command line option does. The remote process may either be one device_add parameters are parsed by the same code as -device. It shouldn't be necessary to add a "remote" option to device_add. > +started at QEMU startup, or be one added by the "add-process" QMP > +command described above. In either case, the remote process proxy will > +forward the new device's JSON description to the corresponding emulation > +process. > + > +live migration > +^^^^^^^^^^^^^^ > + > +The remote process proxy will also register for live migration > +notifications with ``vmstate_register()``. When called to save state, > +the proxy will send the remote process a secondary socket file > +descriptor to save the remote process's device *vmstate* over. The > +incoming byte stream length and data will be saved as the proxy's > +*vmstate*. When the proxy is resumed on its new host, this *vmstate* > +will be extracted, and a secondary socket file descriptor will be sent > +to the new remote process through which it receives the *vmstate* in > +order to restore the devices there. > + > +device emulation in remote process > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +The parts of QEMU that the emulation program will need include the > +object model; the memory emulation objects; the device emulation objects > +of the targeted device, and any dependent devices; and, the device's > +backends. It will also need code to setup the machine environment, > +handle requests from the QEMU process, and route machine-level requests > +(such as interrupts or IOMMU mappings) back to the QEMU process. > + > +initialization > +'''''''''''''' > + > +The process initialization sequence will follow the same sequence > +followed by QEMU. It will first initialize the backend objects, then > +device emulation objects. The JSON descriptions sent by the QEMU process > +will drive which objects need to be created. > + > +- address spaces > + > +Before the device objects are created, the initial address spaces and > +memory regions must be configured with ``memory_map_init()``. This > +creates a RAM memory region object (*system\_memory*) and an IO memory > +region object (*system\_io*). > + > +- RAM > + > +RAM memory region creation will follow how ``pc_memory_init()`` creates > +them, but must use ``memory_region_init_ram_from_fd()`` instead of > +``memory_region_allocate_system_memory()``. The file descriptors needed > +will be supplied by the guest memory table from above. Those RAM regions > +would then be added to the *system\_memory* memory region with > +``memory_region_add_subregion()``. > + > +- PCI > + > +IO initialization will be driven by the JSON descriptions sent from the > +QEMU process. For a PCI device, a PCI bus will need to be created with > +``pci_root_bus_new()``, and a PCI memory region will need to be created > +and added to the *system\_memory* memory region with > +``memory_region_add_subregion_overlap()``. The overlap version is > +required for architectures where PCI memory overlaps with RAM memory. > + > +MMIO handling > +''''''''''''' > + > +The device emulation objects will use ``memory_region_init_io()`` to > +install their MMIO handlers, and ``pci_register_bar()`` to associate > +those handlers with a PCI BAR, as they do within QEMU currently. > + > +In order to use ``address_space_rw()`` in the emulation process to > +handle MMIO requests from QEMU, the PCI physical addresses must be the > +same in the QEMU process and the device emulation process. In order to > +accomplish that, guest BAR programming must also be forwarded from QEMU > +to the emulation process. > + > +interrupt injection > +''''''''''''''''''' > + > +When device emulation wants to inject an interrupt into the VM, the > +request climbs the device's bus object hierarchy until the point where a > +bus object knows how to signal the interrupt to the guest. The details > +depend on the type of interrupt being raised. > + > +- PCI pin interrupts > + > +On x86 systems, there is an emulated IOAPIC object attached to the root > +PCI bus object, and the root PCI object forwards interrupt requests to > +it. The IOAPIC object, in turn, calls the KVM driver to inject the > +corresponding interrupt into the VM. The simplest way to handle this in > +an emulation process would be to setup the root PCI bus driver (via > +``pci_bus_irqs()``) to send a interrupt request back to the QEMU > +process, and have the device proxy object reflect it up the PCI tree > +there. > + > +- PCI MSI/X interrupts > + > +PCI MSI/X interrupts are implemented in HW as DMA writes to a > +CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives > +these DMA writes, then calls into the KVM driver to inject the interrupt > +into the VM. A simple emulation process implementation would be to send > +the MSI DMA address from QEMU as a message at initialization, then > +install an address space handler at that address which forwards the MSI > +message back to QEMU. > + > +DMA operations > +'''''''''''''' > + > +When a emulation object wants to DMA into or out of guest memory, it > +first must use dma\_memory\_map() to convert the DMA address to a local > +virtual address. The emulation process memory region objects setup above > +will be used to translate the DMA address to a local virtual address the > +device emulation code can access. > + > +IOMMU > +''''' > + > +When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory > +regions to translate the DMA address to a guest physical address before > +that physical address can be translated to a local virtual address. The > +emulation process will need similar functionality. > + > +- IOTLB cache > + > +The emulation process will maintain a cache of recent IOMMU translations > +(the IOTLB). When the translate() callback of an IOMMU memory region is > +invoked, the IOTLB cache will be searched for an entry that will map the > +DMA address to a guest PA. On a cache miss, a message will be sent back > +to QEMU requesting the corresponding translation entry, which be both be > +used to return a guest address and be added to the cache. > + > +- IOTLB purge > + > +The IOMMU emulation will also need to act on unmap requests from QEMU. > +These happen when the guest IOMMU driver purges an entry from the > +guest's translation table. > + > +live migration > +'''''''''''''' > + > +When a remote process receives a live migration indication from QEMU, it > +will set up a channel using the received file descriptor with > +``qio_channel_socket_new_fd()``. This channel will be used to create a > +*QEMUfile* that can be passed to ``qemu_save_device_state()`` to send > +the process's device state back to QEMU. This method will be reversed on > +restore - the channel will be passed to ``qemu_loadvm_state()`` to > +restore the device state. > + I have reviewed up to here... :)
On Thu, Oct 24, 2019 at 05:09:29AM -0400, Jagannathan Raman wrote: > +Accelerating device emulation > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +The messages that are required to be sent between QEMU and the emulation > +process can add considerable latency to IO operations. The optimizations > +described below attempt to ameliorate this effect by allowing the > +emulation process to communicate directly with the kernel KVM driver. > +The KVM file descriptors created wold be passed to the emulation process s/wold/would/ I skipped the acceleration section for now because they require kvm.ko changes. I'll focus the remainder of the review on the patches as they are now.
diff --git a/docs/devel/index.rst b/docs/devel/index.rst index 1ec61fc..edd3fe3 100644 --- a/docs/devel/index.rst +++ b/docs/devel/index.rst @@ -22,3 +22,4 @@ Contents: decodetree secure-coding-practices tcg + multi-process diff --git a/docs/devel/qemu-multiprocess.rst b/docs/devel/qemu-multiprocess.rst new file mode 100644 index 0000000..2c42c6e --- /dev/null +++ b/docs/devel/qemu-multiprocess.rst @@ -0,0 +1,1102 @@ +Disaggregating QEMU +=================== + +QEMU is often used as the hypervisor for virtual machines running in the +Oracle cloud. Since one of the advantages of cloud computing is the +ability to run many VMs from different tenants in the same cloud +infrastructure, a guest that compromised its hypervisor could +potentially use the hypervisor's access privileges to access data it is +not authorized for. + +QEMU can be susceptible to security attack because it is a large, +monolithic program that provides many features to the VMs it services. +Many of these feature can be configured out of QEMU, but even a reduced +configuration QEMU has a large amount of code a guest can potentially +attack in order to gain additional privileges. + +QEMU services +------------- + +QEMU can be broadly described as providing three main services. One is a +VM control point, where VMs can be created, migrated, re-configured, and +destroyed. A second is to emulate the CPU instructions within the VM, +often accelerated by HW virtualization features such as Intel's VT +extensions. Finally, it provides IO services to the VM by emulating HW +IO devices, such as disk and network devices. + +A disaggregated QEMU +~~~~~~~~~~~~~~~~~~~~ + +A disaggregated QEMU involves separating QEMU services into separate +host processes. Each of these processes can be given only the privileges +it needs to provide its service, e.g., a disk service could be given +access only the the disk images it provides, and not be allowed to +access other files, or any network devices. An attacker who compromised +this service would not be able to use this exploit to access files or +devices beyond what the disk service was given access to. + +A QEMU control process would remain, but in disaggregated mode, it would +be a control point that executes the processes needed to support the VM +being created, but have no direct interfaces to the VM. During VM +execution, it would still provide the user interface to hot-plug devices +or live migrate the VM. + +A first step in creating a disaggregated QEMU is to separate IO services +from the main QEMU program, which would continue to provide CPU +emulation. i.e., the control process would also be the CPU emulation +process. In a later phase, CPU emulation could be separated from the +control process. + +Disaggregating IO services +-------------------------- + +Disaggregating IO services is a good place to begin QEMU disaggregating +for a couple of reasons. One is the sheer number of IO devices QEMU can +emulate provides a large surface of interfaces which could potentially +be exploited, and, indeed, have been a source of exploits in the past. +Another is the modular nature of QEMU device emulation code provides +interface points where the QEMU functions that perform device emulation +can be separated from the QEMU functions that manage the emulation of +guest CPU instructions. + +QEMU device emulation +~~~~~~~~~~~~~~~~~~~~~ + +QEMU uses a object oriented SW architecture for device emulation code. +Configured objects are all compiled into the QEMU binary, then objects +are instantiated by name when used by the guest VM. For example, the +code to emulate a device named "foo" is always present in QEMU, but its +instantiation code is only run when the device is included in the target +VM. (e.g., via the QEMU command line as *-device foo*) + +The object model is hierarchical, so device emulation code names its +parent object (such as "pci-device" for a PCI device) and QEMU will +instantiate a parent object before calling the device's instantiation +code. + +Current separation models +~~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to separate the device emulation code from the CPU emulation +code, the device object code must run in a different process. There are +a couple of existing QEMU features that can run emulation code +separately from the main QEMU process. These are examined below. + +vhost user model +^^^^^^^^^^^^^^^^ + +Virtio guest device drivers can be connected to vhost user applications +in order to perform their IO operations. This model uses special virtio +device drivers in the guest and vhost user device objects in QEMU, but +once the QEMU vhost user code has configured the vhost user application, +mission-mode IO is performed by the application. The vhost user +application is a daemon process that can be contacted via a known UNIX +domain socket. + +vhost socket +'''''''''''' + +As mentioned above, one of the tasks of the vhost device object within +QEMU is to contact the vhost application and send it configuration +information about this device instance. As part of the configuration +process, the application can also be sent other file descriptors over +the socket, which then can be used by the vhost user application in +various ways, some of which are described below. + +vhost MMIO store acceleration +''''''''''''''''''''''''''''' + +VMs are often run using HW virtualization features via the KVM kernel +driver. This driver allows QEMU to accelerate the emulation of guest CPU +instructions by running the guest in a virtual HW mode. When the guest +executes instructions that cannot be executed by virtual HW mode, +execution returns to the KVM driver so it can inform QEMU to emulate the +instructions in SW. + +One of the events that can cause a return to QEMU is when a guest device +driver accesses an IO location. QEMU then dispatches the memory +operation to the corresponding QEMU device object. In the case of a +vhost user device, the memory operation would need to be sent over a +socket to the vhost application. This path is accelerated by the QEMU +virtio code by setting up an eventfd file descriptor that the vhost +application can directly receive MMIO store notifications from the KVM +driver, instead of needing them to be sent to the QEMU process first. + +vhost interrupt acceleration +'''''''''''''''''''''''''''' + +Another optimization used by the vhost application is the ability to +directly inject interrupts into the VM via the KVM driver, again, +bypassing the need to send the interrupt back to the QEMU process first. +The QEMU virtio setup code configures the KVM driver with an eventfd +that triggers the device interrupt in the guest when the eventfd is +written. This irqfd file descriptor is then passed to the vhost user +application program. + +vhost access to guest memory +'''''''''''''''''''''''''''' + +The vhost application is also allowed to directly access guest memory, +instead of needing to send the data as messages to QEMU. This is also +done with file descriptors sent to the vhost user application by QEMU. +These descriptors can be passed to ``mmap()`` by the vhost application +to map the guest address space into the vhost application. + +IOMMUs introduce another level of complexity, since the address given to +the guest virtio device to DMA to or from is not a guest physical +address. This case is handled by having vhost code within QEMU register +as a listener for IOMMU mapping changes. The vhost application maintains +a cache of IOMMMU translations: sending translation requests back to +QEMU on cache misses, and in turn receiving flush requests from QEMU +when mappings are purged. + +applicability to device separation +'''''''''''''''''''''''''''''''''' + +Much of the vhost model can be re-used by separated device emulation. In +particular, the ideas of using a socket between QEMU and the device +emulation application, using a file descriptor to inject interrupts into +the VM via KVM, and allowing the application to ``mmap()`` the guest +should be re used. + +There are, however, some notable differences between how a vhost +application works and the needs of separated device emulation. The most +basic is that vhost uses custom virtio device drivers which always +trigger IO with MMIO stores. A separated device emulation model must +work with existing IO device models and guest device drivers. MMIO loads +break vhost store acceleration since they are synchronous - guest +progress cannot continue until the load has been emulated. By contrast, +stores are asynchronous, the guest can continue after the store event +has been sent to the vhost application. + +Another difference is that in the vhost user model, a single daemon can +support multiple QEMU instances. This is contrary to the security regime +desired, in which the emulation application should only be allowed to +access the files or devices the VM it's running on behalf of can access. +#### qemu-io model + +Qemu-io is a test harness used to test changes to the QEMU block backend +object code. (e.g., the code that implements disk images for disk driver +emulation) Qemu-io is not a device emulation application per se, but it +does compile the QEMU block objects into a separate binary from the main +QEMU one. This could be useful for disk device emulation, since its +emulation applications will need to include the QEMU block objects. + +New separation model based on proxy objects +------------------------------------------- + +A different model based on proxy objects in the QEMU program +communicating with remote emulation programs could provide separation +while minimizing the changes needed to the device emulation code. The +rest of this section is a discussion of how a proxy object model would +work. + +Remote emulation processes +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The remote emulation process will run the QEMU object hierarchy without +modification. The device emulation objects will be also be based on the +QEMU code, because for anything but the simplest device, it would not be +a tractable to re-implement both the object model and the many device +backends that QEMU has. + +The processes will communicate with the QEMU process over UNIX domain +sockets. The processes can be executed either as standalone processes, +or be executed by QEMU. In both cases, the host backends the emulation +processes will provide are specified on its command line, as they would +be for QEMU. For example: + +:: + + disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \ + -blockdev driver=qcow2,node-name=drive0,file=file0 + +would indicate process *disk-proc* uses a qcow2 emulated disk named +*file0* as its backend. + +Emulation processes may emulate more than one guest controller. A common +configuration might be to put all controllers of the same device class +(e.g., disk, network, etc.) in a single process, so that all backends of +the same type can be managed by a single QMP monitor. + +communication with QEMU +^^^^^^^^^^^^^^^^^^^^^^^ + +Remote emulation processes will recognize a *-socket* argument that +specifies the path of a UNIX domain socket used to communicate with the +QEMU process. If no *-socket* argument is present, the process will use +file descriptor 0 to communicate with QEMU. For example, + +:: + + disk-proc -socket /tmp/disk0-sock <backend list> + +will communicate with QEMU using the socket path */tmp/dik0-sock*. + +remote process QMP monitor +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Remote emulation processes can be monitored via QMP, similar to QEMU +itself. The QMP monitor socket is specified the same as for a QEMU +process: + +:: + + disk-proc -qmp unix:/tmp/disk-mon,server + +can be monitored over the UNIX socket path */tmp/disk-mon*. + +QEMU command line +~~~~~~~~~~~~~~~~~ + +The QEMU command line options will need to be modified to indicate which +items are emulated by a separate program, and which remain emulated by +QEMU itself. + +identifying remote emulation processes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Remote emulation processes will be identified to QEMU using a *-remote* +command line option. This option can either specify a command that QEMU +will execute, or can specify a UNIX domain socket that QEMU can use to +connect to an existing process. Both forms require a "id" option that +identifies the process to later *-device* options. The process version +is: + +:: + + -remote id=disk-proc,command="disk-proc <backend list>" + +And the socket version is: + +:: + + -remote id=disk-proc,socket="/tmp/disk0-sock" + +In the latter case, the remote process must be given the same socket on +its command line when it is executed: + +:: + + disk-proc -socket /tmp/disk0-sock <backend list> + +identifying devices emulated remotely +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Devices that are to be emulated in a separate process will be identify +the remote process with a "remote" option on their *-device* command +line specification. e.g., an LSI SCSI controller and disk can be +specified as: + +:: + + -device lsi53c895a,id=scsi0 + -device scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0 + +If these devices are emulated by remote process "disk-proc," as +described in the previous section, the QEMU command line would be: + +:: + + -device lsi53c895a,id=scsi0,remote=disk-proc + -device scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0,remote=disk-proc + +Some devices are implicitly created by the machine object. e.g., the q35 +machine object will create its PCI bus, and attach an ich9-ahci IDE +controller to it. In this case, options will need to be added to the +*-machine* command line. e.g., + +:: + + -machine pc-q35,ide-remote=disk-proc + +will use the remote process with an "id" of "disk-proc" to emulate the +IDE controller and its disks. + +The disks themselves still need to be specified with *-remote* option, +as in the example above. e.g., + +:: + + -device ide-hd,drive=drive0,bus=ide.0,unit=0,remote=disk-proc + +QEMU management of remote processes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Each *-remote* instance on the QEMU command line will create a remote +process proxy instance in QEMU. They will be held on a *QList* that can +be searched for by its "id" property. The remote process proxy will also +establish a communication channel between QEMU and the remote process. +This can be done in one of two methods: direction execution of the +process by QEMU with ``fork()`` and ``exec()`` system calls, or by +connecting to an existing process. + +direct execution +^^^^^^^^^^^^^^^^ + +When the remote process is directly executed, the remote process proxy +will setup a communication channel between itself and the emulation +process. This channel will be created using ``socketpair()`` and the +remote process side of the pair will be given to the process as file +descriptor 0. + +connecting to an existing process +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Some environments wish to deny QEMU the ability to execute ``fork()`` +and ``exec()`` In these case, emulation processes will be started before +QEMU, and a UNIX domain socket will be given to each emulation process +to communicate with QEMU over. After communication is established, the +socket will be unlinked from the file system space by the QEMU process. + +communication with emulation process +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +primary socket +'''''''''''''' + +Whether the process was executed by QEMU or externally, there will be a +primary socket for communication between QEMU and the remote process. +This channel will handle configuration commands from QEMU to the +process, either from the QEMU command line, or from QMP commands that +affect the devices being emulated by the process. This channel will only +allow one message to be pending at a time; if additional messages +arrive, they must wait for previous ones to be acknowledged from the +remote side. + +secondary sockets +''''''''''''''''' + +The primary socket can pass the file descriptors of secondary sockets +for operations that occur in parallel with commands on the primary +channel. These include MMIO operations generated by the guest, interrupt +notifications generated by the devices being emulated, or *vmstate* for +live migration. These secondary sockets will be created at the behest of +the device proxies that require them. A disk device proxy wouldn't need +any secondary sockets, but a disk controller device proxy may need both +an MMIO socket and an interrupt socket. + +emulation process attached via QMP command +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +There will be a new "attach-process" QMP command to facilitate device +hot-plug. This command's arguments will be the same as the *-remote* +command line when it's used to attach to a remote process. i.e., it will +need an "id" argument so that hot-plugged devices can later find it, and +a "socket" argument to identify the UNIX domain socket that will be used +to communicate with QEMU. + +QEMU device proxy objects +~~~~~~~~~~~~~~~~~~~~~~~~~ + +QEMU has an object model based on sub-classes inherited from the +"object" super-class. The sub-classes that are of interest here are the +"device" and "bus" sub-classes whose child sub-classes make up the +device tree of a QEMU emulated system. + +The proxy object model will use device proxy objects to replace the +device emulation code within the QEMU process. These objects will live +in the same place in the object and bus hierarchies as the objects they +replace. i.e., the proxy object for an LSI SCSI controller will be a +sub-class of the "pci-device" class, and will have the same PCI bus +parent and the same SCSI bus child objects as the LSI controller object +it replaces. + +After the QEMU command line has been parsed, the remote devices will be +instantiated in the same manner as local devices are. (i.e., +``qdev_device_add()``). In order to distinguish them from regular +*-device* device objects, their class name will be the name of the class +it replaces, with "-proxy" appended. e.g., the "lsi53c895a" proxy class +will be "lsi53c895a-proxy." + +device JSON description +^^^^^^^^^^^^^^^^^^^^^^^ + +The remote process needs a JSON representation of the command line +options used to create the object. This JSON representation is used to +create the corresponding object in the emulation process. e.g., for an +LSI SCSI controller invoked as: + +:: + + -device lsi53c895a,id=scsi0,remote=lsi-scsi + +the proxy object would create a + +:: + + { "driver" : "lsi53c895a", "id" : "scsi0" } + +JSON description. The "driver" option is assigned to the device name +when the command line is parsed, so the "-proxy" appended by the command +line parsing code is removed. The "remote" option isn't needed in the +JSON description since it only applies to the proxy object in the QEMU +process. + +device object whitelist +^^^^^^^^^^^^^^^^^^^^^^^ + +Some device objects may not need a proxy. These are devices with no +direct guest interfaces. (e.g., no MMIO, PIO, or interrupts). There will +be a whitelist of such devices, and any devices on this list will not be +instantiated in QEMU. Their JSON representation will still be sent to +the remote process, so the object can be created there. + +object initialization +^^^^^^^^^^^^^^^^^^^^^ + +QEMU object initialization occurs in two phases. The first +initialization happens once per object class. (i.e., there can be many +SCSI disks in an emulated system, but the "scsi-hd" class has its +``class_init()`` function called only once) The second phase happens +when each object's ``instance_init()`` function is called to initialize +each instance of the object. + +All device objects are sub-classes of the "device" class, so they also +have a ``realize()`` function that is called after ``instance_init()`` +is called and after the object's static properties have been +initialized. Many device objects don't even provide an instance\_init() +function, and do all their per-instance work in ``realize()``. + +class\_init +''''''''''' + +The ``class_init()`` method of a proxy object will, in general behave +similarly to the object it replaces, including setting any static +properties and methods needed by the proxy. + +instance\_init / realize +'''''''''''''''''''''''' + +The ``instance_init()`` and ``realize()`` functions would only need to +perform tasks related to being a proxy, such are registering its own +MMIO handlers, or creating a child bus that other proxy devices can be +attached to later. + +Other tasks will are device-specific. For example, PCI device objects +will initialize the PCI config space in order to make a valid PCI device +tree within the QEMU process. + +address space registration +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Most devices are driven by guest device driver accesses to IO addresses +or ports. The QEMU device emulation code uses QEMU's memory region +function calls (such as ``memory_region_init_io()``) to add callback +functions that QEMU will invoke when the guest accesses the device's +areas of the IO address space. When a guest driver does access the +device, the VM will exit HW virtualization mode and return to QEMU, +which will then lookup and execute the corresponding callback function. + +A proxy object would need to mirror the memory region calls the actual +device emulator would perform in its initialization code, but with its +own callbacks. When invoked by QEMU as a result of a guest IO operation, +they will forward the operation to the device emulation process. + +PCI config space +^^^^^^^^^^^^^^^^ + +PCI devices also have a configuration space that can be accessed by the +guest driver. Guest accesses to this space is not handled by the device +emulation object, but by its PCI parent object. Much of this space is +read-only, but certain registers (especially BAR and MSI-related ones) +need to be propagated to the emulation process. + +PCI parent proxy +'''''''''''''''' + +One way to propagate guest PCI config accesses is to create a +"pci-device-proxy" class that can serve as the parent of a PCI device +proxy object. This class's parent would be "pci-device" and it would +override the PCI parent's ``config_read()`` and ``config_write()`` +methods with ones that forward these operations to the emulation +program. + +interrupt receipt +^^^^^^^^^^^^^^^^^ + +A proxy for a device that generates interrupts will need to create a +socket to receive interrupt indications from the emulation process. An +incoming interrupt indication would then be sent up to its bus parent to +be injected into the guest. For example, a PCI device object may use +``pci_set_irq()``. + +live migration +^^^^^^^^^^^^^^ + +The proxy will register to save and restore any *vmstate* it needs over +a live migration event. The device proxy does not need to manage the +remote device's *vmstate*; that will be handled by the remote process +proxy (see below). + +QEMU remote device operation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Generic device operations, such as DMA, will be performs by the remote +process proxy by sending messages to the remote process. + +DMA operations +^^^^^^^^^^^^^^ + +DMA operations would be handled much like vhost applications do. One of +the initial messages sent to the emulation process is a guest memory +table. Each entry in this table consists of a file descriptor and size +that the emulation process can ``mmap()`` to directly access guest +memory, similar to ``vhost_user_set_mem_table()``. Note guest memory +must be backed by file descriptors, such as when QEMU is given the +*-mem-path* command line option. + +IOMMU operations +^^^^^^^^^^^^^^^^ + +When the emulated system includes an IOMMU, the remote process proxy in +QEMU will need to create a socket for IOMMU requests from the emulation +process. It will handle those requests with an +``address_space_get_iotlb_entry()`` call. In order to handle IOMMU +unmaps, the remote process proxy will also register as a listener on the +device's DMA address space. When an IOMMU memory region is created +within the DMA address space, an IOMMU notifier for unmaps will be added +to the memory region that will forward unmaps to the emulation process +over the IOMMU socket. + +device hot-plug via QMP +^^^^^^^^^^^^^^^^^^^^^^^ + +An QMP "device\_add" command can add a device emulated by a remote +process. It needs to add a "remote" option to the command, just as the +*-device* command line option does. The remote process may either be one +started at QEMU startup, or be one added by the "add-process" QMP +command described above. In either case, the remote process proxy will +forward the new device's JSON description to the corresponding emulation +process. + +live migration +^^^^^^^^^^^^^^ + +The remote process proxy will also register for live migration +notifications with ``vmstate_register()``. When called to save state, +the proxy will send the remote process a secondary socket file +descriptor to save the remote process's device *vmstate* over. The +incoming byte stream length and data will be saved as the proxy's +*vmstate*. When the proxy is resumed on its new host, this *vmstate* +will be extracted, and a secondary socket file descriptor will be sent +to the new remote process through which it receives the *vmstate* in +order to restore the devices there. + +device emulation in remote process +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The parts of QEMU that the emulation program will need include the +object model; the memory emulation objects; the device emulation objects +of the targeted device, and any dependent devices; and, the device's +backends. It will also need code to setup the machine environment, +handle requests from the QEMU process, and route machine-level requests +(such as interrupts or IOMMU mappings) back to the QEMU process. + +initialization +'''''''''''''' + +The process initialization sequence will follow the same sequence +followed by QEMU. It will first initialize the backend objects, then +device emulation objects. The JSON descriptions sent by the QEMU process +will drive which objects need to be created. + +- address spaces + +Before the device objects are created, the initial address spaces and +memory regions must be configured with ``memory_map_init()``. This +creates a RAM memory region object (*system\_memory*) and an IO memory +region object (*system\_io*). + +- RAM + +RAM memory region creation will follow how ``pc_memory_init()`` creates +them, but must use ``memory_region_init_ram_from_fd()`` instead of +``memory_region_allocate_system_memory()``. The file descriptors needed +will be supplied by the guest memory table from above. Those RAM regions +would then be added to the *system\_memory* memory region with +``memory_region_add_subregion()``. + +- PCI + +IO initialization will be driven by the JSON descriptions sent from the +QEMU process. For a PCI device, a PCI bus will need to be created with +``pci_root_bus_new()``, and a PCI memory region will need to be created +and added to the *system\_memory* memory region with +``memory_region_add_subregion_overlap()``. The overlap version is +required for architectures where PCI memory overlaps with RAM memory. + +MMIO handling +''''''''''''' + +The device emulation objects will use ``memory_region_init_io()`` to +install their MMIO handlers, and ``pci_register_bar()`` to associate +those handlers with a PCI BAR, as they do within QEMU currently. + +In order to use ``address_space_rw()`` in the emulation process to +handle MMIO requests from QEMU, the PCI physical addresses must be the +same in the QEMU process and the device emulation process. In order to +accomplish that, guest BAR programming must also be forwarded from QEMU +to the emulation process. + +interrupt injection +''''''''''''''''''' + +When device emulation wants to inject an interrupt into the VM, the +request climbs the device's bus object hierarchy until the point where a +bus object knows how to signal the interrupt to the guest. The details +depend on the type of interrupt being raised. + +- PCI pin interrupts + +On x86 systems, there is an emulated IOAPIC object attached to the root +PCI bus object, and the root PCI object forwards interrupt requests to +it. The IOAPIC object, in turn, calls the KVM driver to inject the +corresponding interrupt into the VM. The simplest way to handle this in +an emulation process would be to setup the root PCI bus driver (via +``pci_bus_irqs()``) to send a interrupt request back to the QEMU +process, and have the device proxy object reflect it up the PCI tree +there. + +- PCI MSI/X interrupts + +PCI MSI/X interrupts are implemented in HW as DMA writes to a +CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives +these DMA writes, then calls into the KVM driver to inject the interrupt +into the VM. A simple emulation process implementation would be to send +the MSI DMA address from QEMU as a message at initialization, then +install an address space handler at that address which forwards the MSI +message back to QEMU. + +DMA operations +'''''''''''''' + +When a emulation object wants to DMA into or out of guest memory, it +first must use dma\_memory\_map() to convert the DMA address to a local +virtual address. The emulation process memory region objects setup above +will be used to translate the DMA address to a local virtual address the +device emulation code can access. + +IOMMU +''''' + +When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory +regions to translate the DMA address to a guest physical address before +that physical address can be translated to a local virtual address. The +emulation process will need similar functionality. + +- IOTLB cache + +The emulation process will maintain a cache of recent IOMMU translations +(the IOTLB). When the translate() callback of an IOMMU memory region is +invoked, the IOTLB cache will be searched for an entry that will map the +DMA address to a guest PA. On a cache miss, a message will be sent back +to QEMU requesting the corresponding translation entry, which be both be +used to return a guest address and be added to the cache. + +- IOTLB purge + +The IOMMU emulation will also need to act on unmap requests from QEMU. +These happen when the guest IOMMU driver purges an entry from the +guest's translation table. + +live migration +'''''''''''''' + +When a remote process receives a live migration indication from QEMU, it +will set up a channel using the received file descriptor with +``qio_channel_socket_new_fd()``. This channel will be used to create a +*QEMUfile* that can be passed to ``qemu_save_device_state()`` to send +the process's device state back to QEMU. This method will be reversed on +restore - the channel will be passed to ``qemu_loadvm_state()`` to +restore the device state. + +Accelerating device emulation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The messages that are required to be sent between QEMU and the emulation +process can add considerable latency to IO operations. The optimizations +described below attempt to ameliorate this effect by allowing the +emulation process to communicate directly with the kernel KVM driver. +The KVM file descriptors created wold be passed to the emulation process +via initialization messages, much like the guest memory table is done. +#### MMIO acceleration + +Vhost user applications can receive guest virtio driver stores directly +from KVM. The issue with the eventfd mechanism used by vhost user is +that it does not pass any data with the event indication, so it cannot +handle guest loads or guest stores that carry store data. This concept +could, however, be expanded to cover more cases. + +The expanded idea would require a new type of KVM device: +*KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master +descriptor that QEMU can use for configuration, and a slave descriptor +that the emulation process can use to receive MMIO notifications. QEMU +would create both descriptors using the KVM driver, and pass the slave +descriptor to the emulation process via an initialization message. + +data structures +''''''''''''''' + +- guest physical range + +The guest physical range structure describes the address range that a +device will respond to. It includes the base and length of the range, as +well as which bus the range resides on (e.g., on an x86machine, it can +specify whether the range refers to memory or IO addresses). + +A device can have multiple physical address ranges it responds to (e.g., +a PCI device can have multiple BARs), so the structure will also include +an enumerated identifier to specify which of the device's ranges is +being referred to. + ++--------+----------------------------+ +| Name | Description | ++========+============================+ +| addr | range base address | ++--------+----------------------------+ +| len | range length | ++--------+----------------------------+ +| bus | addr type (memory or IO) | ++--------+----------------------------+ +| id | range ID (e.g., PCI BAR) | ++--------+----------------------------+ + +- MMIO request structure + +This structure describes an MMIO operation. It includes which guest +physical range the MMIO was within, the offset within that range, the +MMIO type (e.g., load or store), and its length and data. It also +includes a sequence number that can be used to reply to the MMIO, and +the CPU that issued the MMIO. + ++----------+------------------------+ +| Name | Description | ++==========+========================+ +| rid | range MMIO is within | ++----------+------------------------+ +| offset | offset withing *rid* | ++----------+------------------------+ +| type | e.g., load or store | ++----------+------------------------+ +| len | MMIO length | ++----------+------------------------+ +| data | store data | ++----------+------------------------+ +| seq | sequence ID | ++----------+------------------------+ + +- MMIO request queues + +MMIO request queues are FIFO arrays of MMIO request structures. There +are two queues: pending queue is for MMIOs that haven't been read by the +emulation program, and the sent queue is for MMIOs that haven't been +acknowledged. The main use of the second queue is to validate MMIO +replies from the emulation program. + +- scoreboard + +Each CPU in the VM is emulated in QEMU by a separate thread, so multiple +MMIOs may be waiting to be consumed by an emulation program and multiple +threads may be waiting for MMIO replies. The scoreboard would contain a +wait queue and sequence number for the per-CPU threads, allowing them to +be individually woken when the MMIO reply is received from the emulation +program. It also tracks the number of posted MMIO stores to the device +that haven't been replied to, in order to satisfy the PCI constraint +that a load to a device will not complete until all previous stores to +that device have been completed. + +- device shadow memory + +Some MMIO loads do not have device side-effects. These MMIOs can be +completed without sending a MMIO request to the emulation program if the +emulation program shares a shadow image of the device's memory image +with the KVM driver. + +The emulation program will ask the KVM driver to allocate memory for the +shadow image, and will then use ``mmap()`` to directly access it. The +emulation program can control KVM access to the shadow image by sending +KVM an access map telling it which areas of the image have no +side-effects (and can be completed immediately), and which require a +MMIO request to the emulation program. The access map can also inform +the KVM drive which size accesses are allowed to the image. + +master descriptor +''''''''''''''''' + +The master descriptor is used by QEMU to configure the new KVM device. +The descriptor would be returned by the KVM driver when QEMU issues a +*KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type. + +KVM\_DEV\_TYPE\_USER device ops + + +The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a +``kvm_register_device_ops()`` call when the KVM system in initialized by +``kvm_init()``. These device ops are called by the KVM driver when QEMU +executes certain ``ioctl()`` operations on its KVM file descriptor. They +include: + +- create + +This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE* +``ioctl()`` on its per-VM file descriptor. It will allocate and +initialize a KVM user device specific data structure, and assign the +*kvm\_device* private field to it. + +- ioctl + +This routine is invoked when QEMU issues an ``ioctl()`` on the master +descriptor. The ``ioctl()`` commands supported are defined by the KVM +device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands: + +*KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor thatwill +be passed to the device emulation program. Only one slave can be created +by each master descriptor. The file operations performed by this +descriptor are described below. + +The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical +address range that the slave descriptor will receive MMIO notifications +for. The range is specified by a guest physical range structure +argument. For buses that assign addresses to devices dynamically, this +command can be executed while the guest is running, such as the case +when a guest changes a device's PCI BAR registers. + +*KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to +register *kvm\_io\_device\_ops* callbacks to be invoked when the guest +performs a MMIO operation within the range. When a range is changed, +``kvm_io_bus_unregister_dev()`` is used to remove the previous +instantiation. + +*KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies +how long KVM will wait for the emulation process to respond to a MMIO +indication. + +- destroy + +This routine is called when the VM instance is destroyed. It will need +to destroy the slave descriptor; and free any memory allocated by the +driver, as well as the *kvm\_device* structure itself. + +slave descriptor +'''''''''''''''' + +The slave descriptor will have its own file operations vector, which +responds to system calls on the descriptor performed by the device +emulation program. + +- read + +A read returns any pending MMIO requests from the KVM driver as MMIO +request structures. Multiple structures can be returned if there are +multiple MMIO operations pending. The MMIO requests are moved from the +pending queue to the sent queue, and if there are threads waiting for +space in the pending to add new MMIO operations, they will be woken +here. + +- write + +A write also consists of a set of MMIO requests. They are compared to +the MMIO requests in the sent queue. Matches are removed from the sent +queue, and any threads waiting for the reply are woken. If a store is +removed, then the number of posted stores in the per-CPU scoreboard is +decremented. When the number is zero, and a non side-effect load was +waiting for posted stores to complete, the load is continued. + +- ioctl + +There are several ioctl()s that can be performed on the slave +descriptor. + +A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to +allocate memory for the shadow image. This memory can later be +``mmap()``\ ed by the emulation process to share the emulation's view of +device memory with the KVM driver. + +A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the +shadow image. It will send the KVM driver a shadow control map, which +specifies which areas of the image can complete guest loads without +sending the load request to the emulation program. It will also specify +the size of load operations that are allowed. + +- poll + +An emulation program will use the ``poll()`` call with a *POLLIN* flag +to determine if there are MMIO requests waiting to be read. It will +return if the pending MMIO request queue is not empty. + +- mmap + +This call allows the emulation program to directly access the shadow +image allocated by the KVM driver. As device emulation updates device +memory, changes with no side-effects will be reflected in the shadow, +and the KVM driver can satisfy guest loads from the shadow image without +needing to wait for the emulation program. + +kvm\_io\_device ops +''''''''''''''''''' + +Each KVM per-CPU thread can handle MMIO operation on behalf of the guest +VM. KVM will use the MMIO's guest physical address to search for a +matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM +driver instead of exiting back to QEMU. If a match is found, the +corresponding callback will be invoked. + +- read + +This callback is invoked when the guest performs a load to the device. +Loads with side-effects must be handled synchronously, with the KVM +driver putting the QEMU thread to sleep waiting for the emulation +process reply before re-starting the guest. Loads that do not have +side-effects may be optimized by satisfying them from the shadow image, +if there are no outstanding stores to the device by this CPU. PCI memory +ordering demands that a load cannot complete before all older stores to +the same device have been completed. + +- write + +Stores can be handled asynchronously unless the pending MMIO request +queue is full. In this case, the QEMU thread must sleep waiting for +space in the queue. Stores will increment the number of posted stores in +the per-CPU scoreboard, in order to implement the PCI ordering +constraint above. + +interrupt acceleration +^^^^^^^^^^^^^^^^^^^^^^ + +This performance optimization would work much like a vhost user +application does, where the QEMU process sets up *eventfds* that cause +the device's corresponding interrupt to be triggered by the KVM driver. +These irq file descriptors are sent to the emulation process at +initialization, and are used when the emulation code raises a device +interrupt. + +intx acceleration +''''''''''''''''' + +Traditional PCI pin interrupts are level based, so, in addition to an +irq file descriptor, a re-sampling file descriptor needs to be sent to +the emulation program. This second file descriptor allows multiple +devices sharing an irq to be notified when the interrupt has been +acknowledged by the guest, so they can re-trigger the interrupt if their +device has not de-asserted its interrupt. + +intx irq descriptor + + +The irq descriptors are created by the proxy object +``using event_notifier_init()`` to create the irq and re-sampling +*eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt. +The interrupt route can be found with +``pci_device_route_intx_to_irq()``. + +intx routing changes + + +Intx routing can be changed when the guest programs the APIC the device +pin is connected to. The proxy object in QEMU will use +``pci_device_set_intx_routing_notifier()`` to be informed of any guest +changes to the route. This handler will broadly follow the VFIO +interrupt logic to change the route: de-assigning the existing irq +descriptor from its route, then assigning it the new route. (see +``vfio_intx_update()``) + +MSI/X acceleration +'''''''''''''''''' + +MSI/X interrupts are sent as DMA transactions to the host. The interrupt +data contains a vector that is programed by the guest, A device may have +multiple MSI interrupts associated with it, so multiple irq descriptors +may need to be sent to the emulation program. + +MSI/X irq descriptor + + +This case will also follow the VFIO example. For each MSI/X interrupt, +an *eventfd* is created, a virtual interrupt is allocated by +``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to +the eventfd with ``kvm_irqchip_add_irqfd_notifier()``. + +MSI/X config space changes + + +The guest may dynamically update several MSI-related tables in the +device's PCI config space. These include per-MSI interrupt enables and +vector data. Additionally, MSIX tables exist in device memory space, not +config space. Much like the BAR case above, the proxy object must look +at guest config space programming to keep the MSI interrupt state +consistent between QEMU and the emulation program. + +-------------- + +Disaggregated CPU emulation +--------------------------- + +After IO services have been disaggregated, a second phase would be to +separate a process to handle CPU instruction emulation from the main +QEMU control function. There are no object separation points for this +code, so the first task would be to create one. + +Host access controls +-------------------- + +Separating QEMU relies on the host OS's access restriction mechanisms to +enforce that the differing processes can only access the objects they +are entitled to. There are a couple types of mechanisms usually provided +by general purpose OSs. + +Discretionary access control +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Discretionary access control allows each user to control who can access +their files. In Linux, this type of control is usually too coarse for +QEMU separation, since it only provides three separate access controls: +one for the same user ID, the second for users IDs with the same group +ID, and the third for all other user IDs. Each device instance would +need a separate user ID to provide access control, which is likely to be +unwieldy for dynamically created VMs. + +Mandatory access control +~~~~~~~~~~~~~~~~~~~~~~~~ + +Mandatory access control allows the OS to add an additional set of +controls on top of discretionary access for the OS to control. It also +adds other attributes to processes and files such as types, roles, and +categories, and can establish rules for how processes and files can +interact. + +Type enforcement +^^^^^^^^^^^^^^^^ + +Type enforcement assigns a *type* attribute to processes and files, and +allows rules to be written on what operations a process with a given +type can perform on a file with a given type. QEMU separation could take +advantage of type enforcement by running the emulation processes with +different types, both from the main QEMU process, and from the emulation +processes of different classes of devices. + +For example, guest disk images and disk emulation processes could have +types separate from the main QEMU process and non-disk emulation +processes, and the type rules could prevent processes other than disk +emulation ones from accessing guest disk images. Similarly, network +emulation processes can have a type separate from the main QEMU process +and non-network emulation process, and only that type can access the +host tun/tap device used to provide guest networking. + +Category enforcement +^^^^^^^^^^^^^^^^^^^^ + +Category enforcement assigns a set of numbers within a given range to +the process or file. The process is granted access to the file if the +process's set is a superset of the file's set. This enforcement can be +used to separate multiple instances of devices in the same class. + +For example, if there are multiple disk devices provides to a guest, +each device emulation process could be provisioned with a separate +category. The different device emulation processes would not be able to +access each other's backing disk images. + +Alternatively, categories could be used in lieu of the type enforcement +scheme described above. In this scenario, different categories would be +used to prevent device emulation processes in different classes from +accessing resources assigned to other classes.