diff mbox

[08/38] ivshmem: Rewrite specification document

Message ID 1456771254-17511-9-git-send-email-armbru@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Markus Armbruster Feb. 29, 2016, 6:40 p.m. UTC
This started as an attempt to update ivshmem_device_spec.txt for
clarity, accuracy and completeness while working on its code, and
quickly became a full rewrite.  Since the diff would be useless
anyway, I'm using the opportunity to rename the file to
ivshmem-spec.txt.

I tried hard to ensure the new text contradicts neither the old text
nor the code.  If the new text contradicts the old text but not the
code, it's probably a bug in the old text.  If the new text
contradicts both, its probably a bug in the new text.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 docs/specs/ivshmem-spec.txt        | 244 +++++++++++++++++++++++++++++++++++++
 docs/specs/ivshmem_device_spec.txt | 161 ------------------------
 2 files changed, 244 insertions(+), 161 deletions(-)
 create mode 100644 docs/specs/ivshmem-spec.txt
 delete mode 100644 docs/specs/ivshmem_device_spec.txt

Comments

Marc-André Lureau March 1, 2016, 11:25 a.m. UTC | #1
On Mon, Feb 29, 2016 at 7:40 PM, Markus Armbruster <armbru@redhat.com> wrote:
> This started as an attempt to update ivshmem_device_spec.txt for
> clarity, accuracy and completeness while working on its code, and
> quickly became a full rewrite.  Since the diff would be useless
> anyway, I'm using the opportunity to rename the file to
> ivshmem-spec.txt.
>
> I tried hard to ensure the new text contradicts neither the old text
> nor the code.  If the new text contradicts the old text but not the
> code, it's probably a bug in the old text.  If the new text
> contradicts both, its probably a bug in the new text.
>
> Signed-off-by: Markus Armbruster <armbru@redhat.com>

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>


> ---
>  docs/specs/ivshmem-spec.txt        | 244 +++++++++++++++++++++++++++++++++++++
>  docs/specs/ivshmem_device_spec.txt | 161 ------------------------
>  2 files changed, 244 insertions(+), 161 deletions(-)
>  create mode 100644 docs/specs/ivshmem-spec.txt
>  delete mode 100644 docs/specs/ivshmem_device_spec.txt
>
> diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt
> new file mode 100644
> index 0000000..0835ba1
> --- /dev/null
> +++ b/docs/specs/ivshmem-spec.txt
> @@ -0,0 +1,244 @@
> += Device Specification for Inter-VM shared memory device =
> +
> +The Inter-VM shared memory device (ivshmem) is designed to share a
> +memory region between multiple QEMU processes running different guests
> +and the host.  In order for all guests to be able to pick up the
> +shared memory area, it is modeled by QEMU as a PCI device exposing
> +said memory to the guest as a PCI BAR.
> +
> +The device can use a shared memory object on the host directly, or it
> +can obtain one from an ivshmem server.
> +
> +In the latter case, the device can additionally interrupt its peers, and
> +get interrupted by its peers.
> +
> +
> +== Configuring the ivshmem PCI device ==
> +
> +There are two basic configurations:
> +
> +- Just shared memory: -device ivshmem,shm=NAME,...
> +
> +  This uses shared memory object NAME.
> +
> +- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,...
> +
> +  An ivshmem server must already be running on the host.  The device
> +  connects to the server's UNIX domain socket via character device
> +  CHR.
> +
> +  Each peer gets assigned a unique ID by the server.  IDs must be
> +  between 0 and 65535.
> +
> +  Interrupts are message-signaled by default (MSI-X).  With msi=off
> +  the device has no MSI-X capability, and uses legacy INTx instead.
> +  vectors=N configures the number of vectors to use.
> +
> +For more details on ivshmem device properties, see The QEMU Emulator
> +User Documentation (qemu-doc.*).
> +
> +
> +== The ivshmem PCI device's guest interface ==
> +
> +The device has vendor ID 1af4, device ID 1110, revision 0.
> +
> +=== PCI BARs ===
> +
> +The ivshmem PCI device has two or three BARs:
> +
> +- BAR0 holds device registers (256 Byte MMIO)
> +- BAR1 holds MSI-X table and PBA (only when using MSI-X)
> +- BAR2 maps the shared memory object
> +
> +There are two ways to use this device:
> +
> +- If you only need the shared memory part, BAR2 suffices.  This way,
> +  you have access to the shared memory in the guest and can use it as
> +  you see fit.  Memnic, for example, uses ivshmem this way from guest
> +  user space (see http://dpdk.org/browse/memnic).
> +
> +- If you additionally need the capability for peers to interrupt each
> +  other, you need BAR0 and, if using MSI-X, BAR1.  You will most
> +  likely want to write a kernel driver to handle interrupts.  Requires
> +  the device to be configured for interrupts, obviously.
> +
> +If the device is configured for interrupts, BAR2 is initially invalid.
> +It becomes safely accessible only after the ivshmem server provided
> +the shared memory.  Guest software should wait for the IVPosition
> +register (described below) to become non-negative before accessing
> +BAR2.
> +
> +The device is not capable to tell guest software whether it is
> +configured for interrupts.
> +
> +=== PCI device registers ===
> +
> +BAR 0 contains the following registers:
> +
> +    Offset  Size  Access      On reset  Function
> +        0     4   read/write        0   Interrupt Mask
> +                                        bit 0: peer interrupt
> +                                        bit 1..31: reserved
> +        4     4   read/write        0   Interrupt Status
> +                                        bit 0: peer interrupt
> +                                        bit 1..31: reserved
> +        8     4   read-only   0 or -1   IVPosition
> +       12     4   write-only      N/A   Doorbell
> +                                        bit 0..15: vector
> +                                        bit 16..31: peer ID
> +       16   240   none            N/A   reserved
> +
> +Software should only access the registers as specified in column
> +"Access".  Reserved bits should be ignored on read, and preserved on
> +write.
> +
> +Interrupt Status and Mask Register together control the legacy INTx
> +interrupt when the device has no MSI-X capability: INTx is asserted
> +when the bit-wise AND of Status and Mask is non-zero and the device
> +has no MSI-X capability.  Interrupt Status Register bit 0 becomes 1
> +when an interrupt request from a peer is received.  Reading the
> +register clears it.
> +
> +IVPosition Register: if the device is not configured for interrupts,
> +this is zero.  Else, it's -1 for a short while after reset, then
> +changes to the device's ID (between 0 and 65535).
> +
> +There is no good way for software to find out whether the device is
> +configured for interrupts.  A positive IVPosition means interrupts,
> +but zero could be either.  The initial -1 cannot be reliably observed.
> +
> +Doorbell Register: writing this register requests to interrupt a peer.
> +The written value's high 16 bits are the ID of the peer to interrupt,
> +and its low 16 bits select an interrupt vector.
> +
> +If the device is not configured for interrupts, the write is ignored.
> +
> +If the interrupt hasn't completed setup, the write is ignored.  The
> +device is not capable to tell guest software whether setup is
> +complete.  Interrupts can regress to this state on migration.
> +
> +If the peer with the requested ID isn't connected, or it has fewer
> +interrupt vectors connected, the write is ignored.  The device is not
> +capable to tell guest software what peers are connected, or how many
> +interrupt vectors are connected.
> +
> +If the peer doesn't use MSI-X, its Interrupt Status register is set to
> +1.  This asserts INTx unless masked by the Interrupt Mask register.
> +The device is not capable to communicate the interrupt vector to guest
> +software then.
> +
> +If the peer uses MSI-X, the interrupt for this vector becomes pending.
> +There is no way for software to clear the pending bit, and a polling
> +mode of operation is therefore impossible with MSI-X.
> +
> +With multiple MSI-X vectors, different vectors can be used to indicate
> +different events have occurred.  The semantics of interrupt vectors
> +are left to the application.
> +
> +
> +== Interrupt infrastructure ==
> +
> +When configured for interrupts, the peers share eventfd objects in
> +addition to shared memory.  The shared resources are managed by an
> +ivshmem server.
> +
> +=== The ivshmem server ===
> +
> +The server listens on a UNIX domain socket.
> +
> +For each new client that connects to the server, the server
> +- picks an ID,
> +- creates eventfd file descriptors for the interrupt vectors,
> +- sends the ID and the file descriptor for the shared memory to the
> +  new client,
> +- sends connect notifications for the new client to the other clients
> +  (these contain file descriptors for sending interrupts),
> +- sends connect notifications for the other clients to the new client,
> +  and
> +- sends interrupt setup messages to the new client (these contain file
> +  descriptors for receiving interrupts).
> +
> +When a client disconnects from the server, the server sends disconnect
> +notifications to the other clients.
> +
> +The next section describes the protocol in detail.
> +
> +If the server terminates without sending disconnect notifications for
> +its connected clients, the clients can elect to continue.  They can
> +communicate with each other normally, but won't receive disconnect
> +notification on disconnect, and no new clients can connect.  There is
> +no way for the clients to connect to a restarted the server.  The
> +device is not capable to tell guest software whether the server is
> +still up.
> +
> +Example server code is in contrib/ivshmem-server/.  Not to be used in
> +production.  It assumes all clients use the same number of interrupt
> +vectors.
> +
> +A standalone client is in contrib/ivshmem-client/.  It can be useful
> +for debugging.
> +
> +=== The ivshmem Client-Server Protocol ===
> +
> +An ivshmem device configured for interrupts connects to an ivshmem
> +server.  This section details the protocol between the two.
> +
> +The connection is one-way: the server sends messages to the client.
> +Each message consists of a single 8 byte little-endian signed number,
> +and may be accompanied by a file descriptor via SCM_RIGHTS.  Both
> +client and server close the connection on error.
> +
> +On connect, the server sends the following messages in order:
> +
> +1. The protocol version number, currently zero.  The client should
> +   close the connection on receipt of versions it can't handle.
> +
> +2. The client's ID.  This is unique among all clients of this server.
> +   IDs must be between 0 and 65535, because the Doorbell register
> +   provides only 16 bits for them.
> +
> +3. The number -1, accompanied by the file descriptor for the shared
> +   memory.
> +
> +4. Connect notifications for existing other clients, if any.  This is
> +   a peer ID (number between 0 and 65535 other than the client's ID),
> +   repeated N times.  Each repetition is accompanied by one file
> +   descriptor.  These are for interrupting the peer with that ID using
> +   vector 0,..,N-1, in order.  If the client is configured for fewer
> +   vectors, it closes the extra file descriptors.  If it is configured
> +   for more, the extra vectors remain unconnected.
> +
> +5. Interrupt setup.  This is the client's own ID, repeated N times.
> +   Each repetition is accompanied by one file descriptor.  These are
> +   for receiving interrupts from peers using vector 0,..,N-1, in
> +   order.  If the client is configured for fewer vectors, it closes
> +   the extra file descriptors.  If it is configured for more, the
> +   extra vectors remain unconnected.
> +
> +From then on, the server sends these kinds of messages:
> +
> +6. Connection / disconnection notification.  This is a peer ID.
> +
> +  - If the number comes with a file descriptor, it's a connection
> +    notification, exactly like in step 4.
> +
> +  - Else, it's a disconnection notification for the peer with that ID.
> +
> +Known bugs:
> +
> +* The protocol changed incompatibly in QEMU 2.5.  Before, messages
> +  were native endian long, and there was no version number.
> +
> +* The protocol is poorly designed.
> +
> +=== The ivshmem Client-Client Protocol ===
> +
> +An ivshmem device configured for interrupts receives eventfd file
> +descriptors for interrupting peers and getting interrupted by peers
> +from the server, as explained in the previous section.
> +
> +To interrupt a peer, the device writes the 8-byte integer 1 in native
> +byte order to the respective file descriptor.
> +
> +To receive an interrupt, the device reads and discards as many 8-byte
> +integers as it can.
> diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt
> deleted file mode 100644
> index d318d65..0000000
> --- a/docs/specs/ivshmem_device_spec.txt
> +++ /dev/null
> @@ -1,161 +0,0 @@
> -
> -Device Specification for Inter-VM shared memory device
> -------------------------------------------------------
> -
> -The Inter-VM shared memory device is designed to share a memory region (created
> -on the host via the POSIX shared memory API) between multiple QEMU processes
> -running different guests. In order for all guests to be able to pick up the
> -shared memory area, it is modeled by QEMU as a PCI device exposing said memory
> -to the guest as a PCI BAR.
> -The memory region does not belong to any guest, but is a POSIX memory object on
> -the host. The host can access this shared memory if needed.
> -
> -The device also provides an optional communication mechanism between guests
> -sharing the same memory object. More details about that in the section 'Guest to
> -guest communication' section.
> -
> -
> -The Inter-VM PCI device
> ------------------------
> -
> -From the VM point of view, the ivshmem PCI device supports three BARs.
> -
> -- BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI is
> -  not used.
> -- BAR1 is used for MSI-X when it is enabled in the device.
> -- BAR2 is used to access the shared memory object.
> -
> -It is your choice how to use the device but you must choose between two
> -behaviors :
> -
> -- basically, if you only need the shared memory part, you will map BAR2.
> -  This way, you have access to the shared memory in guest and can use it as you
> -  see fit (memnic, for example, uses it in userland
> -  http://dpdk.org/browse/memnic).
> -
> -- BAR0 and BAR1 are used to implement an optional communication mechanism
> -  through interrupts in the guests. If you need an event mechanism between the
> -  guests accessing the shared memory, you will most likely want to write a
> -  kernel driver that will handle interrupts. See details in the section 'Guest
> -  to guest communication' section.
> -
> -The behavior is chosen when starting your QEMU processes:
> -- no communication mechanism needed, the first QEMU to start creates the shared
> -  memory on the host, subsequent QEMU processes will use it.
> -
> -- communication mechanism needed, an ivshmem server must be started before any
> -  QEMU processes, then each QEMU process connects to the server unix socket.
> -
> -For more details on the QEMU ivshmem parameters, see qemu-doc documentation.
> -
> -
> -Guest to guest communication
> -----------------------------
> -
> -This section details the communication mechanism between the guests accessing
> -the ivhsmem shared memory.
> -
> -*ivshmem server*
> -
> -This server code is available in qemu.git/contrib/ivshmem-server.
> -
> -The server must be started on the host before any guest.
> -It creates a shared memory object then waits for clients to connect on a unix
> -socket. All the messages are little-endian int64_t integer.
> -
> -For each client (QEMU process) that connects to the server:
> -- the server sends a protocol version, if client does not support it, the client
> -  closes the communication,
> -- the server assigns an ID for this client and sends this ID to him as the first
> -  message,
> -- the server sends a fd to the shared memory object to this client,
> -- the server creates a new set of host eventfds associated to the new client and
> -  sends this set to all already connected clients,
> -- finally, the server sends all the eventfds sets for all clients to the new
> -  client.
> -
> -The server signals all clients when one of them disconnects.
> -
> -The client IDs are limited to 16 bits because of the current implementation (see
> -Doorbell register in 'PCI device registers' subsection). Hence only 65536
> -clients are supported.
> -
> -All the file descriptors (fd to the shared memory, eventfds for each client)
> -are passed to clients using SCM_RIGHTS over the server unix socket.
> -
> -Apart from the current ivshmem implementation in QEMU, an ivshmem client has
> -been provided in qemu.git/contrib/ivshmem-client for debug.
> -
> -*QEMU as an ivshmem client*
> -
> -At initialisation, when creating the ivshmem device, QEMU first receives a
> -protocol version and closes communication with server if it does not match.
> -Then, QEMU gets its ID from the server then makes it available through BAR0
> -IVPosition register for the VM to use (see 'PCI device registers' subsection).
> -QEMU then uses the fd to the shared memory to map it to BAR2.
> -eventfds for all other clients received from the server are stored to implement
> -BAR0 Doorbell register (see 'PCI device registers' subsection).
> -Finally, eventfds assigned to this QEMU process are used to send interrupts in
> -this VM.
> -
> -*PCI device registers*
> -
> -From the VM point of view, the ivshmem PCI device supports 4 registers of
> -32-bits each.
> -
> -enum ivshmem_registers {
> -    IntrMask = 0,
> -    IntrStatus = 4,
> -    IVPosition = 8,
> -    Doorbell = 12
> -};
> -
> -The first two registers are the interrupt mask and status registers.  Mask and
> -status are only used with pin-based interrupts.  They are unused with MSI
> -interrupts.
> -
> -Status Register: The status register is set to 1 when an interrupt occurs.
> -
> -Mask Register: The mask register is bitwise ANDed with the interrupt status
> -and the result will raise an interrupt if it is non-zero.  However, since 1 is
> -the only value the status will be set to, it is only the first bit of the mask
> -that has any effect.  Therefore interrupts can be masked by setting the first
> -bit to 0 and unmasked by setting the first bit to 1.
> -
> -IVPosition Register: The IVPosition register is read-only and reports the
> -guest's ID number.  The guest IDs are non-negative integers.  When using the
> -server, since the server is a separate process, the VM ID will only be set when
> -the device is ready (shared memory is received from the server and accessible
> -via the device).  If the device is not ready, the IVPosition will return -1.
> -Applications should ensure that they have a valid VM ID before accessing the
> -shared memory.
> -
> -Doorbell Register:  To interrupt another guest, a guest must write to the
> -Doorbell register.  The doorbell register is 32-bits, logically divided into
> -two 16-bit fields.  The high 16-bits are the guest ID to interrupt and the low
> -16-bits are the interrupt vector to trigger.  The semantics of the value
> -written to the doorbell depends on whether the device is using MSI or a regular
> -pin-based interrupt.  In short, MSI uses vectors while regular interrupts set
> -the status register.
> -
> -Regular Interrupts
> -
> -If regular interrupts are used (due to either a guest not supporting MSI or the
> -user specifying not to use them on startup) then the value written to the lower
> -16-bits of the Doorbell register results is arbitrary and will trigger an
> -interrupt in the destination guest.
> -
> -Message Signalled Interrupts
> -
> -An ivshmem device may support multiple MSI vectors.  If so, the lower 16-bits
> -written to the Doorbell register must be between 0 and the maximum number of
> -vectors the guest supports.  The lower 16 bits written to the doorbell is the
> -MSI vector that will be raised in the destination guest.  The number of MSI
> -vectors is configurable but it is set when the VM is started.
> -
> -The important thing to remember with MSI is that it is only a signal, no status
> -is set (since MSI interrupts are not shared).  All information other than the
> -interrupt itself should be communicated via the shared memory region.  Devices
> -supporting multiple MSI vectors can use different vectors to indicate different
> -events have occurred.  The semantics of interrupt vectors are left to the
> -user's discretion.
> --
> 2.4.3
>
>
Eric Blake March 1, 2016, 3:46 p.m. UTC | #2
On 02/29/2016 11:40 AM, Markus Armbruster wrote:
> This started as an attempt to update ivshmem_device_spec.txt for
> clarity, accuracy and completeness while working on its code, and
> quickly became a full rewrite.  Since the diff would be useless
> anyway, I'm using the opportunity to rename the file to
> ivshmem-spec.txt.
> 
> I tried hard to ensure the new text contradicts neither the old text
> nor the code.  If the new text contradicts the old text but not the
> code, it's probably a bug in the old text.  If the new text
> contradicts both, its probably a bug in the new text.
> 
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---

> +If the server terminates without sending disconnect notifications for
> +its connected clients, the clients can elect to continue.  They can
> +communicate with each other normally, but won't receive disconnect
> +notification on disconnect, and no new clients can connect.  There is
> +no way for the clients to connect to a restarted the server.  The

s/the server/server/

> +device is not capable to tell guest software whether the server is
> +still up.

Wow - lots of shortcomings in the server protocol.  Food for thought for
future improvements, but I'm happy with your approach of just
documenting pitfalls for now.

> +
> +Known bugs:
> +
> +* The protocol changed incompatibly in QEMU 2.5.  Before, messages
> +  were native endian long, and there was no version number.
> +
> +* The protocol is poorly designed.
Markus Armbruster March 2, 2016, 9:50 a.m. UTC | #3
Eric Blake <eblake@redhat.com> writes:

> On 02/29/2016 11:40 AM, Markus Armbruster wrote:
>> This started as an attempt to update ivshmem_device_spec.txt for
>> clarity, accuracy and completeness while working on its code, and
>> quickly became a full rewrite.  Since the diff would be useless
>> anyway, I'm using the opportunity to rename the file to
>> ivshmem-spec.txt.
>> 
>> I tried hard to ensure the new text contradicts neither the old text
>> nor the code.  If the new text contradicts the old text but not the
>> code, it's probably a bug in the old text.  If the new text
>> contradicts both, its probably a bug in the new text.
>> 
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> ---
>
>> +If the server terminates without sending disconnect notifications for
>> +its connected clients, the clients can elect to continue.  They can
>> +communicate with each other normally, but won't receive disconnect
>> +notification on disconnect, and no new clients can connect.  There is
>> +no way for the clients to connect to a restarted the server.  The
>
> s/the server/server/

Will fix, thanks!

>> +device is not capable to tell guest software whether the server is
>> +still up.
>
> Wow - lots of shortcomings in the server protocol.  Food for thought for
> future improvements, but I'm happy with your approach of just
> documenting pitfalls for now.

Best we can do for 2.6 anyway :)

>> +
>> +Known bugs:
>> +
>> +* The protocol changed incompatibly in QEMU 2.5.  Before, messages
>> +  were native endian long, and there was no version number.
>> +
>> +* The protocol is poorly designed.
diff mbox

Patch

diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt
new file mode 100644
index 0000000..0835ba1
--- /dev/null
+++ b/docs/specs/ivshmem-spec.txt
@@ -0,0 +1,244 @@ 
+= Device Specification for Inter-VM shared memory device =
+
+The Inter-VM shared memory device (ivshmem) is designed to share a
+memory region between multiple QEMU processes running different guests
+and the host.  In order for all guests to be able to pick up the
+shared memory area, it is modeled by QEMU as a PCI device exposing
+said memory to the guest as a PCI BAR.
+
+The device can use a shared memory object on the host directly, or it
+can obtain one from an ivshmem server.
+
+In the latter case, the device can additionally interrupt its peers, and
+get interrupted by its peers.
+
+
+== Configuring the ivshmem PCI device ==
+
+There are two basic configurations:
+
+- Just shared memory: -device ivshmem,shm=NAME,...
+
+  This uses shared memory object NAME.
+
+- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,...
+
+  An ivshmem server must already be running on the host.  The device
+  connects to the server's UNIX domain socket via character device
+  CHR.
+
+  Each peer gets assigned a unique ID by the server.  IDs must be
+  between 0 and 65535.
+
+  Interrupts are message-signaled by default (MSI-X).  With msi=off
+  the device has no MSI-X capability, and uses legacy INTx instead.
+  vectors=N configures the number of vectors to use.
+
+For more details on ivshmem device properties, see The QEMU Emulator
+User Documentation (qemu-doc.*).
+
+
+== The ivshmem PCI device's guest interface ==
+
+The device has vendor ID 1af4, device ID 1110, revision 0.
+
+=== PCI BARs ===
+
+The ivshmem PCI device has two or three BARs:
+
+- BAR0 holds device registers (256 Byte MMIO)
+- BAR1 holds MSI-X table and PBA (only when using MSI-X)
+- BAR2 maps the shared memory object
+
+There are two ways to use this device:
+
+- If you only need the shared memory part, BAR2 suffices.  This way,
+  you have access to the shared memory in the guest and can use it as
+  you see fit.  Memnic, for example, uses ivshmem this way from guest
+  user space (see http://dpdk.org/browse/memnic).
+
+- If you additionally need the capability for peers to interrupt each
+  other, you need BAR0 and, if using MSI-X, BAR1.  You will most
+  likely want to write a kernel driver to handle interrupts.  Requires
+  the device to be configured for interrupts, obviously.
+
+If the device is configured for interrupts, BAR2 is initially invalid.
+It becomes safely accessible only after the ivshmem server provided
+the shared memory.  Guest software should wait for the IVPosition
+register (described below) to become non-negative before accessing
+BAR2.
+
+The device is not capable to tell guest software whether it is
+configured for interrupts.
+
+=== PCI device registers ===
+
+BAR 0 contains the following registers:
+
+    Offset  Size  Access      On reset  Function
+        0     4   read/write        0   Interrupt Mask
+                                        bit 0: peer interrupt
+                                        bit 1..31: reserved
+        4     4   read/write        0   Interrupt Status
+                                        bit 0: peer interrupt
+                                        bit 1..31: reserved
+        8     4   read-only   0 or -1   IVPosition
+       12     4   write-only      N/A   Doorbell
+                                        bit 0..15: vector
+                                        bit 16..31: peer ID
+       16   240   none            N/A   reserved
+
+Software should only access the registers as specified in column
+"Access".  Reserved bits should be ignored on read, and preserved on
+write.
+
+Interrupt Status and Mask Register together control the legacy INTx
+interrupt when the device has no MSI-X capability: INTx is asserted
+when the bit-wise AND of Status and Mask is non-zero and the device
+has no MSI-X capability.  Interrupt Status Register bit 0 becomes 1
+when an interrupt request from a peer is received.  Reading the
+register clears it.
+
+IVPosition Register: if the device is not configured for interrupts,
+this is zero.  Else, it's -1 for a short while after reset, then
+changes to the device's ID (between 0 and 65535).
+
+There is no good way for software to find out whether the device is
+configured for interrupts.  A positive IVPosition means interrupts,
+but zero could be either.  The initial -1 cannot be reliably observed.
+
+Doorbell Register: writing this register requests to interrupt a peer.
+The written value's high 16 bits are the ID of the peer to interrupt,
+and its low 16 bits select an interrupt vector.
+
+If the device is not configured for interrupts, the write is ignored.
+
+If the interrupt hasn't completed setup, the write is ignored.  The
+device is not capable to tell guest software whether setup is
+complete.  Interrupts can regress to this state on migration.
+
+If the peer with the requested ID isn't connected, or it has fewer
+interrupt vectors connected, the write is ignored.  The device is not
+capable to tell guest software what peers are connected, or how many
+interrupt vectors are connected.
+
+If the peer doesn't use MSI-X, its Interrupt Status register is set to
+1.  This asserts INTx unless masked by the Interrupt Mask register.
+The device is not capable to communicate the interrupt vector to guest
+software then.
+
+If the peer uses MSI-X, the interrupt for this vector becomes pending.
+There is no way for software to clear the pending bit, and a polling
+mode of operation is therefore impossible with MSI-X.
+
+With multiple MSI-X vectors, different vectors can be used to indicate
+different events have occurred.  The semantics of interrupt vectors
+are left to the application.
+
+
+== Interrupt infrastructure ==
+
+When configured for interrupts, the peers share eventfd objects in
+addition to shared memory.  The shared resources are managed by an
+ivshmem server.
+
+=== The ivshmem server ===
+
+The server listens on a UNIX domain socket.
+
+For each new client that connects to the server, the server
+- picks an ID,
+- creates eventfd file descriptors for the interrupt vectors,
+- sends the ID and the file descriptor for the shared memory to the
+  new client,
+- sends connect notifications for the new client to the other clients
+  (these contain file descriptors for sending interrupts),
+- sends connect notifications for the other clients to the new client,
+  and
+- sends interrupt setup messages to the new client (these contain file
+  descriptors for receiving interrupts).
+
+When a client disconnects from the server, the server sends disconnect
+notifications to the other clients.
+
+The next section describes the protocol in detail.
+
+If the server terminates without sending disconnect notifications for
+its connected clients, the clients can elect to continue.  They can
+communicate with each other normally, but won't receive disconnect
+notification on disconnect, and no new clients can connect.  There is
+no way for the clients to connect to a restarted the server.  The
+device is not capable to tell guest software whether the server is
+still up.
+
+Example server code is in contrib/ivshmem-server/.  Not to be used in
+production.  It assumes all clients use the same number of interrupt
+vectors.
+
+A standalone client is in contrib/ivshmem-client/.  It can be useful
+for debugging.
+
+=== The ivshmem Client-Server Protocol ===
+
+An ivshmem device configured for interrupts connects to an ivshmem
+server.  This section details the protocol between the two.
+
+The connection is one-way: the server sends messages to the client.
+Each message consists of a single 8 byte little-endian signed number,
+and may be accompanied by a file descriptor via SCM_RIGHTS.  Both
+client and server close the connection on error.
+
+On connect, the server sends the following messages in order:
+
+1. The protocol version number, currently zero.  The client should
+   close the connection on receipt of versions it can't handle.
+
+2. The client's ID.  This is unique among all clients of this server.
+   IDs must be between 0 and 65535, because the Doorbell register
+   provides only 16 bits for them.
+
+3. The number -1, accompanied by the file descriptor for the shared
+   memory.
+
+4. Connect notifications for existing other clients, if any.  This is
+   a peer ID (number between 0 and 65535 other than the client's ID),
+   repeated N times.  Each repetition is accompanied by one file
+   descriptor.  These are for interrupting the peer with that ID using
+   vector 0,..,N-1, in order.  If the client is configured for fewer
+   vectors, it closes the extra file descriptors.  If it is configured
+   for more, the extra vectors remain unconnected.
+
+5. Interrupt setup.  This is the client's own ID, repeated N times.
+   Each repetition is accompanied by one file descriptor.  These are
+   for receiving interrupts from peers using vector 0,..,N-1, in
+   order.  If the client is configured for fewer vectors, it closes
+   the extra file descriptors.  If it is configured for more, the
+   extra vectors remain unconnected.
+
+From then on, the server sends these kinds of messages:
+
+6. Connection / disconnection notification.  This is a peer ID.
+
+  - If the number comes with a file descriptor, it's a connection
+    notification, exactly like in step 4.
+
+  - Else, it's a disconnection notification for the peer with that ID.
+
+Known bugs:
+
+* The protocol changed incompatibly in QEMU 2.5.  Before, messages
+  were native endian long, and there was no version number.
+
+* The protocol is poorly designed.
+
+=== The ivshmem Client-Client Protocol ===
+
+An ivshmem device configured for interrupts receives eventfd file
+descriptors for interrupting peers and getting interrupted by peers
+from the server, as explained in the previous section.
+
+To interrupt a peer, the device writes the 8-byte integer 1 in native
+byte order to the respective file descriptor.
+
+To receive an interrupt, the device reads and discards as many 8-byte
+integers as it can.
diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt
deleted file mode 100644
index d318d65..0000000
--- a/docs/specs/ivshmem_device_spec.txt
+++ /dev/null
@@ -1,161 +0,0 @@ 
-
-Device Specification for Inter-VM shared memory device
-------------------------------------------------------
-
-The Inter-VM shared memory device is designed to share a memory region (created
-on the host via the POSIX shared memory API) between multiple QEMU processes
-running different guests. In order for all guests to be able to pick up the
-shared memory area, it is modeled by QEMU as a PCI device exposing said memory
-to the guest as a PCI BAR.
-The memory region does not belong to any guest, but is a POSIX memory object on
-the host. The host can access this shared memory if needed.
-
-The device also provides an optional communication mechanism between guests
-sharing the same memory object. More details about that in the section 'Guest to
-guest communication' section.
-
-
-The Inter-VM PCI device
------------------------
-
-From the VM point of view, the ivshmem PCI device supports three BARs.
-
-- BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI is
-  not used.
-- BAR1 is used for MSI-X when it is enabled in the device.
-- BAR2 is used to access the shared memory object.
-
-It is your choice how to use the device but you must choose between two
-behaviors :
-
-- basically, if you only need the shared memory part, you will map BAR2.
-  This way, you have access to the shared memory in guest and can use it as you
-  see fit (memnic, for example, uses it in userland
-  http://dpdk.org/browse/memnic).
-
-- BAR0 and BAR1 are used to implement an optional communication mechanism
-  through interrupts in the guests. If you need an event mechanism between the
-  guests accessing the shared memory, you will most likely want to write a
-  kernel driver that will handle interrupts. See details in the section 'Guest
-  to guest communication' section.
-
-The behavior is chosen when starting your QEMU processes:
-- no communication mechanism needed, the first QEMU to start creates the shared
-  memory on the host, subsequent QEMU processes will use it.
-
-- communication mechanism needed, an ivshmem server must be started before any
-  QEMU processes, then each QEMU process connects to the server unix socket.
-
-For more details on the QEMU ivshmem parameters, see qemu-doc documentation.
-
-
-Guest to guest communication
-----------------------------
-
-This section details the communication mechanism between the guests accessing
-the ivhsmem shared memory.
-
-*ivshmem server*
-
-This server code is available in qemu.git/contrib/ivshmem-server.
-
-The server must be started on the host before any guest.
-It creates a shared memory object then waits for clients to connect on a unix
-socket. All the messages are little-endian int64_t integer.
-
-For each client (QEMU process) that connects to the server:
-- the server sends a protocol version, if client does not support it, the client
-  closes the communication,
-- the server assigns an ID for this client and sends this ID to him as the first
-  message,
-- the server sends a fd to the shared memory object to this client,
-- the server creates a new set of host eventfds associated to the new client and
-  sends this set to all already connected clients,
-- finally, the server sends all the eventfds sets for all clients to the new
-  client.
-
-The server signals all clients when one of them disconnects.
-
-The client IDs are limited to 16 bits because of the current implementation (see
-Doorbell register in 'PCI device registers' subsection). Hence only 65536
-clients are supported.
-
-All the file descriptors (fd to the shared memory, eventfds for each client)
-are passed to clients using SCM_RIGHTS over the server unix socket.
-
-Apart from the current ivshmem implementation in QEMU, an ivshmem client has
-been provided in qemu.git/contrib/ivshmem-client for debug.
-
-*QEMU as an ivshmem client*
-
-At initialisation, when creating the ivshmem device, QEMU first receives a
-protocol version and closes communication with server if it does not match.
-Then, QEMU gets its ID from the server then makes it available through BAR0
-IVPosition register for the VM to use (see 'PCI device registers' subsection).
-QEMU then uses the fd to the shared memory to map it to BAR2.
-eventfds for all other clients received from the server are stored to implement
-BAR0 Doorbell register (see 'PCI device registers' subsection).
-Finally, eventfds assigned to this QEMU process are used to send interrupts in
-this VM.
-
-*PCI device registers*
-
-From the VM point of view, the ivshmem PCI device supports 4 registers of
-32-bits each.
-
-enum ivshmem_registers {
-    IntrMask = 0,
-    IntrStatus = 4,
-    IVPosition = 8,
-    Doorbell = 12
-};
-
-The first two registers are the interrupt mask and status registers.  Mask and
-status are only used with pin-based interrupts.  They are unused with MSI
-interrupts.
-
-Status Register: The status register is set to 1 when an interrupt occurs.
-
-Mask Register: The mask register is bitwise ANDed with the interrupt status
-and the result will raise an interrupt if it is non-zero.  However, since 1 is
-the only value the status will be set to, it is only the first bit of the mask
-that has any effect.  Therefore interrupts can be masked by setting the first
-bit to 0 and unmasked by setting the first bit to 1.
-
-IVPosition Register: The IVPosition register is read-only and reports the
-guest's ID number.  The guest IDs are non-negative integers.  When using the
-server, since the server is a separate process, the VM ID will only be set when
-the device is ready (shared memory is received from the server and accessible
-via the device).  If the device is not ready, the IVPosition will return -1.
-Applications should ensure that they have a valid VM ID before accessing the
-shared memory.
-
-Doorbell Register:  To interrupt another guest, a guest must write to the
-Doorbell register.  The doorbell register is 32-bits, logically divided into
-two 16-bit fields.  The high 16-bits are the guest ID to interrupt and the low
-16-bits are the interrupt vector to trigger.  The semantics of the value
-written to the doorbell depends on whether the device is using MSI or a regular
-pin-based interrupt.  In short, MSI uses vectors while regular interrupts set
-the status register.
-
-Regular Interrupts
-
-If regular interrupts are used (due to either a guest not supporting MSI or the
-user specifying not to use them on startup) then the value written to the lower
-16-bits of the Doorbell register results is arbitrary and will trigger an
-interrupt in the destination guest.
-
-Message Signalled Interrupts
-
-An ivshmem device may support multiple MSI vectors.  If so, the lower 16-bits
-written to the Doorbell register must be between 0 and the maximum number of
-vectors the guest supports.  The lower 16 bits written to the doorbell is the
-MSI vector that will be raised in the destination guest.  The number of MSI
-vectors is configurable but it is set when the VM is started.
-
-The important thing to remember with MSI is that it is only a signal, no status
-is set (since MSI interrupts are not shared).  All information other than the
-interrupt itself should be communicated via the shared memory region.  Devices
-supporting multiple MSI vectors can use different vectors to indicate different
-events have occurred.  The semantics of interrupt vectors are left to the
-user's discretion.