Message ID | 1407488118-11245-3-git-send-email-david.marchand@6wind.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hello David, On 08.08.2014 10:55, David Marchand wrote: > Add some notes on the parts needed to use ivshmem devices: more specifically, > explain the purpose of an ivshmem server and the basic concept to use the > ivshmem devices in guests. > Move some parts of the documentation and re-organise it. > > Signed-off-by: David Marchand <david.marchand@6wind.com> You did not include my Reviewed-by: tag, did you change this from v2? Ciao, Claudio > --- > docs/specs/ivshmem_device_spec.txt | 124 +++++++++++++++++++++++++++--------- > 1 file changed, 93 insertions(+), 31 deletions(-) > > diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt > index 667a862..f5f2b95 100644 > --- a/docs/specs/ivshmem_device_spec.txt > +++ b/docs/specs/ivshmem_device_spec.txt > @@ -2,30 +2,103 @@ > Device Specification for Inter-VM shared memory device > ------------------------------------------------------ > > -The Inter-VM shared memory device is designed to share a region of memory to > -userspace in multiple virtual guests. The memory region does not belong to any > -guest, but is a POSIX memory object on the host. Optionally, the device may > -support sending interrupts to other guests sharing the same memory region. > +The Inter-VM shared memory device is designed to share a memory region (created > +on the host via the POSIX shared memory API) between multiple QEMU processes > +running different guests. In order for all guests to be able to pick up the > +shared memory area, it is modeled by QEMU as a PCI device exposing said memory > +to the guest as a PCI BAR. > +The memory region does not belong to any guest, but is a POSIX memory object on > +the host. The host can access this shared memory if needed. > + > +The device also provides an optional communication mechanism between guests > +sharing the same memory object. More details about that in the section 'Guest to > +guest communication' section. > > > The Inter-VM PCI device > ----------------------- > > -*BARs* > +From the VM point of view, the ivshmem PCI device supports three BARs. > + > +- BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI is > + not used. > +- BAR1 is used for MSI-X when it is enabled in the device. > +- BAR2 is used to access the shared memory object. > + > +It is your choice how to use the device but you must choose between two > +behaviors : > + > +- basically, if you only need the shared memory part, you will map BAR2. > + This way, you have access to the shared memory in guest and can use it as you > + see fit (memnic, for example, uses it in userland > + http://dpdk.org/browse/memnic). > + > +- BAR0 and BAR1 are used to implement an optional communication mechanism > + through interrupts in the guests. If you need an event mechanism between the > + guests accessing the shared memory, you will most likely want to write a > + kernel driver that will handle interrupts. See details in the section 'Guest > + to guest communication' section. > + > +The behavior is chosen when starting your QEMU processes: > +- no communication mechanism needed, the first QEMU to start creates the shared > + memory on the host, subsequent QEMU processes will use it. > + > +- communication mechanism needed, an ivshmem server must be started before any > + QEMU processes, then each QEMU process connects to the server unix socket. > + > +For more details on the QEMU ivshmem parameters, see qemu-doc documentation. > + > + > +Guest to guest communication > +---------------------------- > + > +This section details the communication mechanism between the guests accessing > +the ivhsmem shared memory. > > -The device supports three BARs. BAR0 is a 1 Kbyte MMIO region to support > -registers. BAR1 is used for MSI-X when it is enabled in the device. BAR2 is > -used to map the shared memory object from the host. The size of BAR2 is > -specified when the guest is started and must be a power of 2 in size. > +*ivshmem server* > > -*Registers* > +This server code is available in qemu.git/contrib/ivshmem-server. > > -The device currently supports 4 registers of 32-bits each. Registers > -are used for synchronization between guests sharing the same memory object when > -interrupts are supported (this requires using the shared memory server). > +The server must be started on the host before any guest. > +It creates a shared memory object then waits for clients to connect on an unix > +socket. > > -The server assigns each VM an ID number and sends this ID number to the QEMU > -process when the guest starts. > +For each client (QEMU processes) that connects to the server: > +- the server assigns an ID for this client and sends this ID to him as the first > + message, > +- the server sends a fd to the shared memory object to this client, > +- the server creates a new set of host eventfds associated to the new client and > + sends this set to all already connected clients, > +- finally, the server sends all the eventfds sets for all clients to the new > + client. > + > +The server signals all clients when one of them disconnects. > + > +The client IDs are limited to 16 bits because of the current implementation (see > +Doorbell register in 'PCI device registers' subsection). Hence on 65536 clients > +are supported. > + > +All the file descriptors (fd to the shared memory, eventfds for each client) > +are passed to clients using SCM_RIGHTS over the server unix socket. > + > +Apart from the current ivshmem implementation in QEMU, an ivshmem client has > +been provided in qemu.git/contrib/ivshmem-client for debug. > + > +*QEMU as an ivshmem client* > + > +At initialisation, when creating the ivshmem device, QEMU gets its ID from the > +server then make it available through BAR0 IVPosition register for the VM to use > +(see 'PCI device registers' subsection). > +QEMU then uses the fd to the shared memory to map it to BAR2. > +eventfds for all other clients received from the server are stored to implement > +BAR0 Doorbell register (see 'PCI device registers' subsection). > +Finally, eventfds assigned to this QEMU process are used to send interrupts in > +this VM. > + > +*PCI device registers* > + > +From the VM point of view, the ivshmem PCI device supports 4 registers of > +32-bits each. > > enum ivshmem_registers { > IntrMask = 0, > @@ -49,8 +122,8 @@ bit to 0 and unmasked by setting the first bit to 1. > IVPosition Register: The IVPosition register is read-only and reports the > guest's ID number. The guest IDs are non-negative integers. When using the > server, since the server is a separate process, the VM ID will only be set when > -the device is ready (shared memory is received from the server and accessible via > -the device). If the device is not ready, the IVPosition will return -1. > +the device is ready (shared memory is received from the server and accessible > +via the device). If the device is not ready, the IVPosition will return -1. > Applications should ensure that they have a valid VM ID before accessing the > shared memory. > > @@ -59,8 +132,8 @@ Doorbell register. The doorbell register is 32-bits, logically divided into > two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low > 16-bits are the interrupt vector to trigger. The semantics of the value > written to the doorbell depends on whether the device is using MSI or a regular > -pin-based interrupt. In short, MSI uses vectors while regular interrupts set the > -status register. > +pin-based interrupt. In short, MSI uses vectors while regular interrupts set > +the status register. > > Regular Interrupts > > @@ -71,7 +144,7 @@ interrupt in the destination guest. > > Message Signalled Interrupts > > -A ivshmem device may support multiple MSI vectors. If so, the lower 16-bits > +An ivshmem device may support multiple MSI vectors. If so, the lower 16-bits > written to the Doorbell register must be between 0 and the maximum number of > vectors the guest supports. The lower 16 bits written to the doorbell is the > MSI vector that will be raised in the destination guest. The number of MSI > @@ -83,14 +156,3 @@ interrupt itself should be communicated via the shared memory region. Devices > supporting multiple MSI vectors can use different vectors to indicate different > events have occurred. The semantics of interrupt vectors are left to the > user's discretion. > - > - > -Usage in the Guest > ------------------- > - > -The shared memory device is intended to be used with the provided UIO driver. > -Very little configuration is needed. The guest should map BAR0 to access the > -registers (an array of 32-bit ints allows simple writing) and map BAR2 to > -access the shared memory region itself. The size of the shared memory region > -is specified when the guest (or shared memory server) is started. A guest may > -map the whole shared memory region or only part of it. >
Hello Claudio, On 08/08/2014 11:04 AM, Claudio Fontana wrote: > On 08.08.2014 10:55, David Marchand wrote: >> Add some notes on the parts needed to use ivshmem devices: more specifically, >> explain the purpose of an ivshmem server and the basic concept to use the >> ivshmem devices in guests. >> Move some parts of the documentation and re-organise it. >> >> Signed-off-by: David Marchand <david.marchand@6wind.com> > > You did not include my Reviewed-by: tag, did you change this from v2? > No, I did not change anything in the documentation patch, I only touched the client/server patch. I forgot to add your Reviewed-by tag ... (added to my tree for now, will send a v4 when I have some feedback on the client/server code). Thanks Claudio.
On Fri, Aug 08, 2014 at 10:55:18AM +0200, David Marchand wrote: > +For each client (QEMU processes) that connects to the server: > +- the server assigns an ID for this client and sends this ID to him as the first > + message, > +- the server sends a fd to the shared memory object to this client, > +- the server creates a new set of host eventfds associated to the new client and > + sends this set to all already connected clients, > +- finally, the server sends all the eventfds sets for all clients to the new > + client. The protocol is not extensible and no version number is exchanged. For the most part this should be okay because clients must run on the same machine as the server. It is assumed clients and server are compatible with each other. I wonder if we'll get into trouble later if the protocol needs to be extended or some operation needs to happen, like upgrading QEMU or the ivshmem-server. At the very least someone building from source but using system QEMU or ivshmem-server could get confusing failures if the protocol doesn't match. How about sending a version message as the first thing during a connection? Stefan
Hello Stefan, On 08/08/2014 05:02 PM, Stefan Hajnoczi wrote: > On Fri, Aug 08, 2014 at 10:55:18AM +0200, David Marchand wrote: >> +For each client (QEMU processes) that connects to the server: >> +- the server assigns an ID for this client and sends this ID to him as the first >> + message, >> +- the server sends a fd to the shared memory object to this client, >> +- the server creates a new set of host eventfds associated to the new client and >> + sends this set to all already connected clients, >> +- finally, the server sends all the eventfds sets for all clients to the new >> + client. > > The protocol is not extensible and no version number is exchanged. For > the most part this should be okay because clients must run on the same > machine as the server. It is assumed clients and server are compatible > with each other. > > I wonder if we'll get into trouble later if the protocol needs to be > extended or some operation needs to happen, like upgrading QEMU or the > ivshmem-server. At the very least someone building from source but > using system QEMU or ivshmem-server could get confusing failures if the > protocol doesn't match. > > How about sending a version message as the first thing during a > connection? I am not too sure about this. This would break current base version. Using a version message supposes we want to keep ivshmem-server and QEMU separated (for example, in two distribution packages) while we can avoid this, so why would we do so ? If we want the ivshmem-server to come with QEMU, then both are supposed to be aligned on your system. If you want to test local modifications, then it means you know what you are doing and you will call the right ivshmem-server binary with the right QEMU binary. Thanks.
Il 26/08/2014 08:47, David Marchand ha scritto: > > Using a version message supposes we want to keep ivshmem-server and QEMU > separated (for example, in two distribution packages) while we can avoid > this, so why would we do so ? > > If we want the ivshmem-server to come with QEMU, then both are supposed > to be aligned on your system. What about upgrading QEMU and ivshmem-server while you have existing guests? You cannot restart ivshmem-server, and the new QEMU would have to talk to the old ivshmem-server. Paolo > If you want to test local modifications, then it means you know what you > are doing and you will call the right ivshmem-server binary with the > right QEMU binary. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 26, 2014 at 01:04:30PM +0200, Paolo Bonzini wrote: > Il 26/08/2014 08:47, David Marchand ha scritto: > > > > Using a version message supposes we want to keep ivshmem-server and QEMU > > separated (for example, in two distribution packages) while we can avoid > > this, so why would we do so ? > > > > If we want the ivshmem-server to come with QEMU, then both are supposed > > to be aligned on your system. > > What about upgrading QEMU and ivshmem-server while you have existing > guests? You cannot restart ivshmem-server, and the new QEMU would have > to talk to the old ivshmem-server. Version negotiation also helps avoid confusion if someone combines ivshmem-server and QEMU from different origins (e.g. built from source and distro packaged). It's a safeguard to prevent hard-to-diagnose failures when the system is misconfigured. Stefan
On 08/28/2014 11:49 AM, Stefan Hajnoczi wrote: > On Tue, Aug 26, 2014 at 01:04:30PM +0200, Paolo Bonzini wrote: >> Il 26/08/2014 08:47, David Marchand ha scritto: >>> >>> Using a version message supposes we want to keep ivshmem-server and QEMU >>> separated (for example, in two distribution packages) while we can avoid >>> this, so why would we do so ? >>> >>> If we want the ivshmem-server to come with QEMU, then both are supposed >>> to be aligned on your system. >> >> What about upgrading QEMU and ivshmem-server while you have existing >> guests? You cannot restart ivshmem-server, and the new QEMU would have >> to talk to the old ivshmem-server. > > Version negotiation also helps avoid confusion if someone combines > ivshmem-server and QEMU from different origins (e.g. built from source > and distro packaged). > > It's a safeguard to prevent hard-to-diagnose failures when the system is > misconfigured. > Hum, so you want the code to be defensive against mis-use, why not. I wanted to keep modifications on ivshmem as little as possible in a first phase (all the more so as there are potential ivshmem users out there that I think will be impacted by a protocol change). Sending the version as the first "vm_id" with an associated fd to -1 before sending the real client id should work with existing QEMU client code (hw/misc/ivshmem.c). Do you have a better idea ? Is there a best practice in QEMU for "version negotiation" that could work with ivshmem protocol ? I have a v4 ready with this (and all the pending comments), I will send it later unless a better idea is exposed. Thanks.
On 09/01/2014 03:52 AM, David Marchand wrote: >>> What about upgrading QEMU and ivshmem-server while you have existing >>> guests? You cannot restart ivshmem-server, and the new QEMU would have >>> to talk to the old ivshmem-server. >> >> Version negotiation also helps avoid confusion if someone combines >> ivshmem-server and QEMU from different origins (e.g. built from source >> and distro packaged). Don't underestimate the likelihood of this happening. Any long-running process (which an ivshmem-server will be) continues running at the old version, even when a package upgrade installs a new qemu binary; the new binary should still be able to manage connections to the already-running server. Even neater would be a solution where an existing ivshmem-server could re-exec an updated ivshmem-server binary that resulted from a distro upgrade, hand over all state required for the new server to take over from the point managed by the old server, so that you aren't stuck running the old binary forever. But that's a lot trickier to write, so it is not necessary for a first implementation; and if you do that, then you have the reverse situation to worry about (the new server must still accept communication from existing old qemu binaries). Note that the goal here is to support upgrades; it is probably okay if downgrading from a new binary back to an old doesn't work correctly (because the new software was using a feature not present in the old). >> >> It's a safeguard to prevent hard-to-diagnose failures when the system is >> misconfigured. >> > > Hum, so you want the code to be defensive against mis-use, why not. > > I wanted to keep modifications on ivshmem as little as possible in a > first phase (all the more so as there are potential ivshmem users out > there that I think will be impacted by a protocol change). Existing ivshmem users MUST be aware that they are using something that is not yet polished, and be prepared to make the upgrade to the polished version. It's best to minimize the hassle by making them upgrade exactly once to a fully-robust version, rather than to have them upgrade to a slightly-more robust version only to find out we didn't plan ahead well enough to make further extensions in a back-compatible manner. > > Sending the version as the first "vm_id" with an associated fd to -1 > before sending the real client id should work with existing QEMU client > code (hw/misc/ivshmem.c). > > Do you have a better idea ? > Is there a best practice in QEMU for "version negotiation" that could > work with ivshmem protocol ? QMP starts off with a mandatory "qmp_capabilities" handshake, although we haven't yet had to define any capabilities where cross-versioned communication differs as a result. Migration is somewhat of an example, except that it is one-directional (we don't have a feedback path), so it is somewhat best effort. The qcow2 v3 file format is an example of declaring features, rather than version numbers, and making decisions about whether a feature is compatible (older clients can safely ignore the bit, without corrupting the image but possibly having worse performance) vs. incompatible (older clients must reject the image, because not handling the feature correctly would corrupt the image). The best handshakes are bi-directional - both sides advertise their version (or better, their features), and a well-defined algorithm for settling on the common subset of advertised features then ensures that the two sides know how to talk to each other, or give a reason for either side to disconnect early because of a missing feature. > > I have a v4 ready with this (and all the pending comments), I will send > it later unless a better idea is exposed. > > > Thanks. >
diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt index 667a862..f5f2b95 100644 --- a/docs/specs/ivshmem_device_spec.txt +++ b/docs/specs/ivshmem_device_spec.txt @@ -2,30 +2,103 @@ Device Specification for Inter-VM shared memory device ------------------------------------------------------ -The Inter-VM shared memory device is designed to share a region of memory to -userspace in multiple virtual guests. The memory region does not belong to any -guest, but is a POSIX memory object on the host. Optionally, the device may -support sending interrupts to other guests sharing the same memory region. +The Inter-VM shared memory device is designed to share a memory region (created +on the host via the POSIX shared memory API) between multiple QEMU processes +running different guests. In order for all guests to be able to pick up the +shared memory area, it is modeled by QEMU as a PCI device exposing said memory +to the guest as a PCI BAR. +The memory region does not belong to any guest, but is a POSIX memory object on +the host. The host can access this shared memory if needed. + +The device also provides an optional communication mechanism between guests +sharing the same memory object. More details about that in the section 'Guest to +guest communication' section. The Inter-VM PCI device ----------------------- -*BARs* +From the VM point of view, the ivshmem PCI device supports three BARs. + +- BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI is + not used. +- BAR1 is used for MSI-X when it is enabled in the device. +- BAR2 is used to access the shared memory object. + +It is your choice how to use the device but you must choose between two +behaviors : + +- basically, if you only need the shared memory part, you will map BAR2. + This way, you have access to the shared memory in guest and can use it as you + see fit (memnic, for example, uses it in userland + http://dpdk.org/browse/memnic). + +- BAR0 and BAR1 are used to implement an optional communication mechanism + through interrupts in the guests. If you need an event mechanism between the + guests accessing the shared memory, you will most likely want to write a + kernel driver that will handle interrupts. See details in the section 'Guest + to guest communication' section. + +The behavior is chosen when starting your QEMU processes: +- no communication mechanism needed, the first QEMU to start creates the shared + memory on the host, subsequent QEMU processes will use it. + +- communication mechanism needed, an ivshmem server must be started before any + QEMU processes, then each QEMU process connects to the server unix socket. + +For more details on the QEMU ivshmem parameters, see qemu-doc documentation. + + +Guest to guest communication +---------------------------- + +This section details the communication mechanism between the guests accessing +the ivhsmem shared memory. -The device supports three BARs. BAR0 is a 1 Kbyte MMIO region to support -registers. BAR1 is used for MSI-X when it is enabled in the device. BAR2 is -used to map the shared memory object from the host. The size of BAR2 is -specified when the guest is started and must be a power of 2 in size. +*ivshmem server* -*Registers* +This server code is available in qemu.git/contrib/ivshmem-server. -The device currently supports 4 registers of 32-bits each. Registers -are used for synchronization between guests sharing the same memory object when -interrupts are supported (this requires using the shared memory server). +The server must be started on the host before any guest. +It creates a shared memory object then waits for clients to connect on an unix +socket. -The server assigns each VM an ID number and sends this ID number to the QEMU -process when the guest starts. +For each client (QEMU processes) that connects to the server: +- the server assigns an ID for this client and sends this ID to him as the first + message, +- the server sends a fd to the shared memory object to this client, +- the server creates a new set of host eventfds associated to the new client and + sends this set to all already connected clients, +- finally, the server sends all the eventfds sets for all clients to the new + client. + +The server signals all clients when one of them disconnects. + +The client IDs are limited to 16 bits because of the current implementation (see +Doorbell register in 'PCI device registers' subsection). Hence on 65536 clients +are supported. + +All the file descriptors (fd to the shared memory, eventfds for each client) +are passed to clients using SCM_RIGHTS over the server unix socket. + +Apart from the current ivshmem implementation in QEMU, an ivshmem client has +been provided in qemu.git/contrib/ivshmem-client for debug. + +*QEMU as an ivshmem client* + +At initialisation, when creating the ivshmem device, QEMU gets its ID from the +server then make it available through BAR0 IVPosition register for the VM to use +(see 'PCI device registers' subsection). +QEMU then uses the fd to the shared memory to map it to BAR2. +eventfds for all other clients received from the server are stored to implement +BAR0 Doorbell register (see 'PCI device registers' subsection). +Finally, eventfds assigned to this QEMU process are used to send interrupts in +this VM. + +*PCI device registers* + +From the VM point of view, the ivshmem PCI device supports 4 registers of +32-bits each. enum ivshmem_registers { IntrMask = 0, @@ -49,8 +122,8 @@ bit to 0 and unmasked by setting the first bit to 1. IVPosition Register: The IVPosition register is read-only and reports the guest's ID number. The guest IDs are non-negative integers. When using the server, since the server is a separate process, the VM ID will only be set when -the device is ready (shared memory is received from the server and accessible via -the device). If the device is not ready, the IVPosition will return -1. +the device is ready (shared memory is received from the server and accessible +via the device). If the device is not ready, the IVPosition will return -1. Applications should ensure that they have a valid VM ID before accessing the shared memory. @@ -59,8 +132,8 @@ Doorbell register. The doorbell register is 32-bits, logically divided into two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low 16-bits are the interrupt vector to trigger. The semantics of the value written to the doorbell depends on whether the device is using MSI or a regular -pin-based interrupt. In short, MSI uses vectors while regular interrupts set the -status register. +pin-based interrupt. In short, MSI uses vectors while regular interrupts set +the status register. Regular Interrupts @@ -71,7 +144,7 @@ interrupt in the destination guest. Message Signalled Interrupts -A ivshmem device may support multiple MSI vectors. If so, the lower 16-bits +An ivshmem device may support multiple MSI vectors. If so, the lower 16-bits written to the Doorbell register must be between 0 and the maximum number of vectors the guest supports. The lower 16 bits written to the doorbell is the MSI vector that will be raised in the destination guest. The number of MSI @@ -83,14 +156,3 @@ interrupt itself should be communicated via the shared memory region. Devices supporting multiple MSI vectors can use different vectors to indicate different events have occurred. The semantics of interrupt vectors are left to the user's discretion. - - -Usage in the Guest ------------------- - -The shared memory device is intended to be used with the provided UIO driver. -Very little configuration is needed. The guest should map BAR0 to access the -registers (an array of 32-bit ints allows simple writing) and map BAR2 to -access the shared memory region itself. The size of the shared memory region -is specified when the guest (or shared memory server) is started. A guest may -map the whole shared memory region or only part of it.
Add some notes on the parts needed to use ivshmem devices: more specifically, explain the purpose of an ivshmem server and the basic concept to use the ivshmem devices in guests. Move some parts of the documentation and re-organise it. Signed-off-by: David Marchand <david.marchand@6wind.com> --- docs/specs/ivshmem_device_spec.txt | 124 +++++++++++++++++++++++++++--------- 1 file changed, 93 insertions(+), 31 deletions(-)