diff mbox series

[v8,10/10] Documentation: Add documentation for VDUSE

Message ID 20210615141331.407-11-xieyongji@bytedance.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Introduce VDUSE - vDPA Device in Userspace | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Yongji Xie June 15, 2021, 2:13 p.m. UTC
VDUSE (vDPA Device in Userspace) is a framework to support
implementing software-emulated vDPA devices in userspace. This
document is intended to clarify the VDUSE design and usage.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/userspace-api/index.rst |   1 +
 Documentation/userspace-api/vduse.rst | 222 ++++++++++++++++++++++++++++++++++
 2 files changed, 223 insertions(+)
 create mode 100644 Documentation/userspace-api/vduse.rst

Comments

Stefan Hajnoczi June 24, 2021, 1:01 p.m. UTC | #1
On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> VDUSE (vDPA Device in Userspace) is a framework to support
> implementing software-emulated vDPA devices in userspace. This
> document is intended to clarify the VDUSE design and usage.
> 
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>  Documentation/userspace-api/index.rst |   1 +
>  Documentation/userspace-api/vduse.rst | 222 ++++++++++++++++++++++++++++++++++
>  2 files changed, 223 insertions(+)
>  create mode 100644 Documentation/userspace-api/vduse.rst
> 
> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> index 0b5eefed027e..c432be070f67 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -27,6 +27,7 @@ place where this information is gathered.
>     iommu
>     media/index
>     sysfs-platform_profile
> +   vduse
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
> new file mode 100644
> index 000000000000..2f9cd1a4e530
> --- /dev/null
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -0,0 +1,222 @@
> +==================================
> +VDUSE - "vDPA Device in Userspace"
> +==================================
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace. And
> +to make it simple, the emulated vDPA device's control path is handled in
> +the kernel and only the data path is implemented in the userspace.
> +
> +Note that only virtio block device is supported by VDUSE framework now,
> +which can reduce security risks when the userspace process that implements
> +the data path is run by an unprivileged user. The Support for other device
> +types can be added after the security issue is clarified or fixed in the future.
> +
> +Start/Stop VDUSE devices
> +------------------------
> +
> +VDUSE devices are started as follows:
> +
> +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> +   /dev/vduse/control.
> +
> +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> +   messages will arrive while attaching the VDUSE instance to vDPA bus.
> +
> +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> +   instance to vDPA bus.
> +
> +VDUSE devices are stopped as follows:
> +
> +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
> +   instance from vDPA bus.
> +
> +2. Close the file descriptor referring to /dev/vduse/$NAME
> +
> +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
> +   /dev/vduse/control
> +
> +The netlink messages metioned above can be sent via vdpa tool in iproute2
> +or use the below sample codes:
> +
> +.. code-block:: c
> +
> +	static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
> +	{
> +		struct nl_sock *nlsock;
> +		struct nl_msg *msg;
> +		int famid;
> +
> +		nlsock = nl_socket_alloc();
> +		if (!nlsock)
> +			return -ENOMEM;
> +
> +		if (genl_connect(nlsock))
> +			goto free_sock;
> +
> +		famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> +		if (famid < 0)
> +			goto close_sock;
> +
> +		msg = nlmsg_alloc();
> +		if (!msg)
> +			goto close_sock;
> +
> +		if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, cmd, 0))
> +			goto nla_put_failure;
> +
> +		NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> +		if (cmd == VDPA_CMD_DEV_NEW)
> +			NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> +
> +		if (nl_send_sync(nlsock, msg))
> +			goto close_sock;
> +
> +		nl_close(nlsock);
> +		nl_socket_free(nlsock);
> +
> +		return 0;
> +	nla_put_failure:
> +		nlmsg_free(msg);
> +	close_sock:
> +		nl_close(nlsock);
> +	free_sock:
> +		nl_socket_free(nlsock);
> +		return -1;
> +	}
> +
> +How VDUSE works
> +---------------
> +
> +Since the emuldated vDPA device's control path is handled in the kernel,

s/emuldated/emulated/

> +a message-based communication protocol and few types of control messages
> +are introduced by VDUSE framework to make userspace be aware of the data
> +path related changes:
> +
> +- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace
> +
> +- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane
> +
> +- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane
> +
> +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> +
> +Userspace needs to read()/write() on /dev/vduse/$NAME to receive/reply
> +those control messages from/to VDUSE kernel module as follows:
> +
> +.. code-block:: c
> +
> +	static int vduse_message_handler(int dev_fd)
> +	{
> +		int len;
> +		struct vduse_dev_request req;
> +		struct vduse_dev_response resp;
> +
> +		len = read(dev_fd, &req, sizeof(req));
> +		if (len != sizeof(req))
> +			return -1;
> +
> +		resp.request_id = req.request_id;
> +
> +		switch (req.type) {
> +
> +		/* handle different types of message */
> +
> +		}
> +
> +		if (req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> +			return 0;
> +
> +		len = write(dev_fd, &resp, sizeof(resp));
> +		if (len != sizeof(resp))
> +			return -1;
> +
> +		return 0;
> +	}
> +
> +After VDUSE_START_DATAPLANE messages is received, userspace should start the
> +dataplane processing with the help of some ioctls on /dev/vduse/$NAME:
> +
> +- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region.
> +  Userspace can access this iova region by passing fd and corresponding size, offset,
> +  perm to mmap(). For example:
> +
> +.. code-block:: c
> +
> +	static int perm_to_prot(uint8_t perm)
> +	{
> +		int prot = 0;
> +
> +		switch (perm) {
> +		case VDUSE_ACCESS_WO:
> +			prot |= PROT_WRITE;
> +			break;
> +		case VDUSE_ACCESS_RO:
> +			prot |= PROT_READ;
> +			break;
> +		case VDUSE_ACCESS_RW:
> +			prot |= PROT_READ | PROT_WRITE;
> +			break;
> +		}
> +
> +		return prot;
> +	}
> +
> +	static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> +	{
> +		int fd;
> +		void *addr;
> +		size_t size;
> +		struct vduse_iotlb_entry entry;
> +
> +		entry.start = iova;
> +		entry.last = iova + 1;

Why +1?

I expected the request to include *len so that VDUSE can create a bounce
buffer for the full iova range, if necessary.

> +		fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
> +		if (fd < 0)
> +			return NULL;
> +
> +		size = entry.last - entry.start + 1;
> +		*len = entry.last - iova + 1;
> +		addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> +			    fd, entry.offset);
> +		close(fd);
> +		if (addr == MAP_FAILED)
> +			return NULL;
> +
> +		/* do something to cache this iova region */

How is userspace expected to manage iotlb mmaps? When should munmap(2)
be called?

Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of
guest RAM (e.g. multiple gigabytes) that can be cached permanently or
will it return just enough pages to cover [start, last)?

> +
> +		return addr + iova - entry.start;
> +	}
> +
> +- VDUSE_DEV_GET_FEATURES: Get the negotiated features

Are these VIRTIO feature bits? Please explain how feature negotiation
works. There must be a way for userspace to report the device's
supported feature bits to the kernel.

> +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt

Does this mean the contents of the configuration space are cached by
VDUSE? The downside is that the userspace code cannot generate the
contents on demand. Most devices doin't need to generate the contents
on demand, so I think this is okay but I had expected a different
interface:

kernel->userspace VDUSE_DEV_GET_CONFIG
userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ

I think you can leave it the way it is, but I wanted to mention this in
case someone thinks it's important to support generating the contents of
the configuration space on demand.

> +- VDUSE_VQ_GET_INFO: Get the specified virtqueue's metadata
> +
> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> +  by VDUSE kernel module to notify userspace to consume the vring.
> +
> +- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue

This information is useful but it's not enough to be able to implement a
userspace device. Please provide more developer documentation or at
least refer to uapi header files, published documents, etc that contain
the details.
Yongji Xie June 29, 2021, 5:43 a.m. UTC | #2
On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > VDUSE (vDPA Device in Userspace) is a framework to support
> > implementing software-emulated vDPA devices in userspace. This
> > document is intended to clarify the VDUSE design and usage.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >  Documentation/userspace-api/index.rst |   1 +
> >  Documentation/userspace-api/vduse.rst | 222 ++++++++++++++++++++++++++++++++++
> >  2 files changed, 223 insertions(+)
> >  create mode 100644 Documentation/userspace-api/vduse.rst
> >
> > diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> > index 0b5eefed027e..c432be070f67 100644
> > --- a/Documentation/userspace-api/index.rst
> > +++ b/Documentation/userspace-api/index.rst
> > @@ -27,6 +27,7 @@ place where this information is gathered.
> >     iommu
> >     media/index
> >     sysfs-platform_profile
> > +   vduse
> >
> >  .. only::  subproject and html
> >
> > diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
> > new file mode 100644
> > index 000000000000..2f9cd1a4e530
> > --- /dev/null
> > +++ b/Documentation/userspace-api/vduse.rst
> > @@ -0,0 +1,222 @@
> > +==================================
> > +VDUSE - "vDPA Device in Userspace"
> > +==================================
> > +
> > +vDPA (virtio data path acceleration) device is a device that uses a
> > +datapath which complies with the virtio specifications with vendor
> > +specific control path. vDPA devices can be both physically located on
> > +the hardware or emulated by software. VDUSE is a framework that makes it
> > +possible to implement software-emulated vDPA devices in userspace. And
> > +to make it simple, the emulated vDPA device's control path is handled in
> > +the kernel and only the data path is implemented in the userspace.
> > +
> > +Note that only virtio block device is supported by VDUSE framework now,
> > +which can reduce security risks when the userspace process that implements
> > +the data path is run by an unprivileged user. The Support for other device
> > +types can be added after the security issue is clarified or fixed in the future.
> > +
> > +Start/Stop VDUSE devices
> > +------------------------
> > +
> > +VDUSE devices are started as follows:
> > +
> > +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> > +   /dev/vduse/control.
> > +
> > +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> > +   messages will arrive while attaching the VDUSE instance to vDPA bus.
> > +
> > +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> > +   instance to vDPA bus.
> > +
> > +VDUSE devices are stopped as follows:
> > +
> > +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
> > +   instance from vDPA bus.
> > +
> > +2. Close the file descriptor referring to /dev/vduse/$NAME
> > +
> > +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
> > +   /dev/vduse/control
> > +
> > +The netlink messages metioned above can be sent via vdpa tool in iproute2
> > +or use the below sample codes:
> > +
> > +.. code-block:: c
> > +
> > +     static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
> > +     {
> > +             struct nl_sock *nlsock;
> > +             struct nl_msg *msg;
> > +             int famid;
> > +
> > +             nlsock = nl_socket_alloc();
> > +             if (!nlsock)
> > +                     return -ENOMEM;
> > +
> > +             if (genl_connect(nlsock))
> > +                     goto free_sock;
> > +
> > +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > +             if (famid < 0)
> > +                     goto close_sock;
> > +
> > +             msg = nlmsg_alloc();
> > +             if (!msg)
> > +                     goto close_sock;
> > +
> > +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, cmd, 0))
> > +                     goto nla_put_failure;
> > +
> > +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > +             if (cmd == VDPA_CMD_DEV_NEW)
> > +                     NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> > +
> > +             if (nl_send_sync(nlsock, msg))
> > +                     goto close_sock;
> > +
> > +             nl_close(nlsock);
> > +             nl_socket_free(nlsock);
> > +
> > +             return 0;
> > +     nla_put_failure:
> > +             nlmsg_free(msg);
> > +     close_sock:
> > +             nl_close(nlsock);
> > +     free_sock:
> > +             nl_socket_free(nlsock);
> > +             return -1;
> > +     }
> > +
> > +How VDUSE works
> > +---------------
> > +
> > +Since the emuldated vDPA device's control path is handled in the kernel,
>
> s/emuldated/emulated/
>

Will fix it.

> > +a message-based communication protocol and few types of control messages
> > +are introduced by VDUSE framework to make userspace be aware of the data
> > +path related changes:
> > +
> > +- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace
> > +
> > +- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane
> > +
> > +- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane
> > +
> > +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> > +
> > +Userspace needs to read()/write() on /dev/vduse/$NAME to receive/reply
> > +those control messages from/to VDUSE kernel module as follows:
> > +
> > +.. code-block:: c
> > +
> > +     static int vduse_message_handler(int dev_fd)
> > +     {
> > +             int len;
> > +             struct vduse_dev_request req;
> > +             struct vduse_dev_response resp;
> > +
> > +             len = read(dev_fd, &req, sizeof(req));
> > +             if (len != sizeof(req))
> > +                     return -1;
> > +
> > +             resp.request_id = req.request_id;
> > +
> > +             switch (req.type) {
> > +
> > +             /* handle different types of message */
> > +
> > +             }
> > +
> > +             if (req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> > +                     return 0;
> > +
> > +             len = write(dev_fd, &resp, sizeof(resp));
> > +             if (len != sizeof(resp))
> > +                     return -1;
> > +
> > +             return 0;
> > +     }
> > +
> > +After VDUSE_START_DATAPLANE messages is received, userspace should start the
> > +dataplane processing with the help of some ioctls on /dev/vduse/$NAME:
> > +
> > +- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region.
> > +  Userspace can access this iova region by passing fd and corresponding size, offset,
> > +  perm to mmap(). For example:
> > +
> > +.. code-block:: c
> > +
> > +     static int perm_to_prot(uint8_t perm)
> > +     {
> > +             int prot = 0;
> > +
> > +             switch (perm) {
> > +             case VDUSE_ACCESS_WO:
> > +                     prot |= PROT_WRITE;
> > +                     break;
> > +             case VDUSE_ACCESS_RO:
> > +                     prot |= PROT_READ;
> > +                     break;
> > +             case VDUSE_ACCESS_RW:
> > +                     prot |= PROT_READ | PROT_WRITE;
> > +                     break;
> > +             }
> > +
> > +             return prot;
> > +     }
> > +
> > +     static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > +     {
> > +             int fd;
> > +             void *addr;
> > +             size_t size;
> > +             struct vduse_iotlb_entry entry;
> > +
> > +             entry.start = iova;
> > +             entry.last = iova + 1;
>
> Why +1?
>
> I expected the request to include *len so that VDUSE can create a bounce
> buffer for the full iova range, if necessary.
>

The function is used to translate iova to va. And the *len is not
specified by the caller. Instead, it's used to tell the caller the
length of the contiguous iova region from the specified iova. And the
ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
overlapped iova region. So using iova + 1 should be enough here.

> > +             fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
> > +             if (fd < 0)
> > +                     return NULL;
> > +
> > +             size = entry.last - entry.start + 1;
> > +             *len = entry.last - iova + 1;
> > +             addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> > +                         fd, entry.offset);
> > +             close(fd);
> > +             if (addr == MAP_FAILED)
> > +                     return NULL;
> > +
> > +             /* do something to cache this iova region */
>
> How is userspace expected to manage iotlb mmaps? When should munmap(2)
> be called?
>

The simple way is using a list to store the iotlb mappings. And we
should call the munmap(2) for the old mappings when VDUSE_UPDATE_IOTLB
or VDUSE_STOP_DATAPLANE message is received.

> Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of
> guest RAM (e.g. multiple gigabytes) that can be cached permanently or
> will it return just enough pages to cover [start, last)?
>

It should return one iotlb mapping that covers [start, last). In
vhost-vdpa cases, it might be a full chunk of guest RAM. In
virtio-vdpa cases, it might be the whole bounce buffer or one coherent
mapping (produced by dma_alloc_coherent()).

> > +
> > +             return addr + iova - entry.start;
> > +     }
> > +
> > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
>
> Are these VIRTIO feature bits? Please explain how feature negotiation
> works. There must be a way for userspace to report the device's
> supported feature bits to the kernel.
>

Yes, these are VIRTIO feature bits. Userspace will specify the
device's supported feature bits when creating a new VDUSE device with
ioctl(VDUSE_CREATE_DEV).

> > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt
>
> Does this mean the contents of the configuration space are cached by
> VDUSE?

Yes, but the kernel will also store the same contents.

> The downside is that the userspace code cannot generate the
> contents on demand. Most devices doin't need to generate the contents
> on demand, so I think this is okay but I had expected a different
> interface:
>
> kernel->userspace VDUSE_DEV_GET_CONFIG
> userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
>

The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
will need lots of modification of virtio codes to support that. So to
make it simple, we choose this way:

userspace -> kernel VDUSE_DEV_SET_CONFIG
userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ

> I think you can leave it the way it is, but I wanted to mention this in
> case someone thinks it's important to support generating the contents of
> the configuration space on demand.
>

Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and
VDUSE_DEV_INJECT_CONFIG_IRQ achieve that?

> > +- VDUSE_VQ_GET_INFO: Get the specified virtqueue's metadata
> > +
> > +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> > +  by VDUSE kernel module to notify userspace to consume the vring.
> > +
> > +- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
>
> This information is useful but it's not enough to be able to implement a
> userspace device. Please provide more developer documentation or at
> least refer to uapi header files, published documents, etc that contain
> the details.

OK, I will try to add more details.

Thanks,
Yongji
Stefan Hajnoczi June 30, 2021, 10:06 a.m. UTC | #3
On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > +     static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > > +     {
> > > +             int fd;
> > > +             void *addr;
> > > +             size_t size;
> > > +             struct vduse_iotlb_entry entry;
> > > +
> > > +             entry.start = iova;
> > > +             entry.last = iova + 1;
> >
> > Why +1?
> >
> > I expected the request to include *len so that VDUSE can create a bounce
> > buffer for the full iova range, if necessary.
> >
> 
> The function is used to translate iova to va. And the *len is not
> specified by the caller. Instead, it's used to tell the caller the
> length of the contiguous iova region from the specified iova. And the
> ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> overlapped iova region. So using iova + 1 should be enough here.

Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
wonder why userspace needs to assign a value at all if it's always +1.

> 
> > > +             fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
> > > +             if (fd < 0)
> > > +                     return NULL;
> > > +
> > > +             size = entry.last - entry.start + 1;
> > > +             *len = entry.last - iova + 1;
> > > +             addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> > > +                         fd, entry.offset);
> > > +             close(fd);
> > > +             if (addr == MAP_FAILED)
> > > +                     return NULL;
> > > +
> > > +             /* do something to cache this iova region */
> >
> > How is userspace expected to manage iotlb mmaps? When should munmap(2)
> > be called?
> >
> 
> The simple way is using a list to store the iotlb mappings. And we
> should call the munmap(2) for the old mappings when VDUSE_UPDATE_IOTLB
> or VDUSE_STOP_DATAPLANE message is received.

Thanks for explaining. It would be helpful to have a description of
IOTLB operation in this document.

> > Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of
> > guest RAM (e.g. multiple gigabytes) that can be cached permanently or
> > will it return just enough pages to cover [start, last)?
> >
> 
> It should return one iotlb mapping that covers [start, last). In
> vhost-vdpa cases, it might be a full chunk of guest RAM. In
> virtio-vdpa cases, it might be the whole bounce buffer or one coherent
> mapping (produced by dma_alloc_coherent()).

Great, thanks. Adding something about this to the documentation would
help others implementing VDUSE devices or libraries.

> > > +
> > > +             return addr + iova - entry.start;
> > > +     }
> > > +
> > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> >
> > Are these VIRTIO feature bits? Please explain how feature negotiation
> > works. There must be a way for userspace to report the device's
> > supported feature bits to the kernel.
> >
> 
> Yes, these are VIRTIO feature bits. Userspace will specify the
> device's supported feature bits when creating a new VDUSE device with
> ioctl(VDUSE_CREATE_DEV).

Can the VDUSE device influence feature bit negotiation? For example, if
the VDUSE virtio-blk device does not implement discard/write-zeroes, how
does QEMU or the guest find out about this?

> > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt
> >
> > Does this mean the contents of the configuration space are cached by
> > VDUSE?
> 
> Yes, but the kernel will also store the same contents.
> 
> > The downside is that the userspace code cannot generate the
> > contents on demand. Most devices doin't need to generate the contents
> > on demand, so I think this is okay but I had expected a different
> > interface:
> >
> > kernel->userspace VDUSE_DEV_GET_CONFIG
> > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> >
> 
> The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> will need lots of modification of virtio codes to support that. So to
> make it simple, we choose this way:
> 
> userspace -> kernel VDUSE_DEV_SET_CONFIG
> userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> 
> > I think you can leave it the way it is, but I wanted to mention this in
> > case someone thinks it's important to support generating the contents of
> > the configuration space on demand.
> >
> 
> Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and
> VDUSE_DEV_INJECT_CONFIG_IRQ achieve that?

If the contents of the configuration space change continuously, then the
VDUSE_DEV_SET_CONFIG approach is inefficient and might have race
conditions. For example, imagine a device where the driver can read a
timer from the configuration space. I think the VIRTIO device model
allows that although I'm not aware of any devices that do something like
it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be
called frequently to keep the timer value updated even though the guest
driver probably isn't accessing it.

What's worse is that there might be race conditions where other
driver->device operations are supposed to update the configuration space
but VDUSE_DEV_SET_CONFIG means that the VDUSE kernel code is caching an
outdated copy.

Again, I don't think it's a problem for existing devices in the VIRTIO
specification. But I'm not 100% sure and future devices might require
what I've described, so the VDUSE_DEV_SET_CONFIG interface could become
a problem.

Stefan
Yongji Xie July 1, 2021, 10 a.m. UTC | #4
On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > > +     static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > > > +     {
> > > > +             int fd;
> > > > +             void *addr;
> > > > +             size_t size;
> > > > +             struct vduse_iotlb_entry entry;
> > > > +
> > > > +             entry.start = iova;
> > > > +             entry.last = iova + 1;
> > >
> > > Why +1?
> > >
> > > I expected the request to include *len so that VDUSE can create a bounce
> > > buffer for the full iova range, if necessary.
> > >
> >
> > The function is used to translate iova to va. And the *len is not
> > specified by the caller. Instead, it's used to tell the caller the
> > length of the contiguous iova region from the specified iova. And the
> > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> > overlapped iova region. So using iova + 1 should be enough here.
>
> Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
> wonder why userspace needs to assign a value at all if it's always +1.
>

If we need to get some iova regions in the specified range, we need
the entry.last field. For example, we can use [0, ULONG_MAX] to get
the first overlapped iova region which might be [4096, 8192]. But in
this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to
get the iova region including the specified iova.

> >
> > > > +             fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
> > > > +             if (fd < 0)
> > > > +                     return NULL;
> > > > +
> > > > +             size = entry.last - entry.start + 1;
> > > > +             *len = entry.last - iova + 1;
> > > > +             addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> > > > +                         fd, entry.offset);
> > > > +             close(fd);
> > > > +             if (addr == MAP_FAILED)
> > > > +                     return NULL;
> > > > +
> > > > +             /* do something to cache this iova region */
> > >
> > > How is userspace expected to manage iotlb mmaps? When should munmap(2)
> > > be called?
> > >
> >
> > The simple way is using a list to store the iotlb mappings. And we
> > should call the munmap(2) for the old mappings when VDUSE_UPDATE_IOTLB
> > or VDUSE_STOP_DATAPLANE message is received.
>
> Thanks for explaining. It would be helpful to have a description of
> IOTLB operation in this document.
>

Sure.

> > > Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of
> > > guest RAM (e.g. multiple gigabytes) that can be cached permanently or
> > > will it return just enough pages to cover [start, last)?
> > >
> >
> > It should return one iotlb mapping that covers [start, last). In
> > vhost-vdpa cases, it might be a full chunk of guest RAM. In
> > virtio-vdpa cases, it might be the whole bounce buffer or one coherent
> > mapping (produced by dma_alloc_coherent()).
>
> Great, thanks. Adding something about this to the documentation would
> help others implementing VDUSE devices or libraries.
>

OK.

> > > > +
> > > > +             return addr + iova - entry.start;
> > > > +     }
> > > > +
> > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> > >
> > > Are these VIRTIO feature bits? Please explain how feature negotiation
> > > works. There must be a way for userspace to report the device's
> > > supported feature bits to the kernel.
> > >
> >
> > Yes, these are VIRTIO feature bits. Userspace will specify the
> > device's supported feature bits when creating a new VDUSE device with
> > ioctl(VDUSE_CREATE_DEV).
>
> Can the VDUSE device influence feature bit negotiation? For example, if
> the VDUSE virtio-blk device does not implement discard/write-zeroes, how
> does QEMU or the guest find out about this?
>

There is a "features" field in struct vduse_dev_config which is used
to do feature negotiation.

> > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt
> > >
> > > Does this mean the contents of the configuration space are cached by
> > > VDUSE?
> >
> > Yes, but the kernel will also store the same contents.
> >
> > > The downside is that the userspace code cannot generate the
> > > contents on demand. Most devices doin't need to generate the contents
> > > on demand, so I think this is okay but I had expected a different
> > > interface:
> > >
> > > kernel->userspace VDUSE_DEV_GET_CONFIG
> > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > >
> >
> > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> > will need lots of modification of virtio codes to support that. So to
> > make it simple, we choose this way:
> >
> > userspace -> kernel VDUSE_DEV_SET_CONFIG
> > userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> >
> > > I think you can leave it the way it is, but I wanted to mention this in
> > > case someone thinks it's important to support generating the contents of
> > > the configuration space on demand.
> > >
> >
> > Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and
> > VDUSE_DEV_INJECT_CONFIG_IRQ achieve that?
>
> If the contents of the configuration space change continuously, then the
> VDUSE_DEV_SET_CONFIG approach is inefficient and might have race
> conditions. For example, imagine a device where the driver can read a
> timer from the configuration space. I think the VIRTIO device model
> allows that although I'm not aware of any devices that do something like
> it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be
> called frequently to keep the timer value updated even though the guest
> driver probably isn't accessing it.
>

OK, I get you now. Since the VIRTIO specification says "Device
configuration space is generally used for rarely-changing or
initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
ioctl should not be called frequently.

> What's worse is that there might be race conditions where other
> driver->device operations are supposed to update the configuration space
> but VDUSE_DEV_SET_CONFIG means that the VDUSE kernel code is caching an
> outdated copy.
>

I'm not sure. Should the device and driver be able to access the same
fields concurrently?

> Again, I don't think it's a problem for existing devices in the VIRTIO
> specification. But I'm not 100% sure and future devices might require
> what I've described, so the VDUSE_DEV_SET_CONFIG interface could become
> a problem.
>

If so, maybe a new interface can be added at that time. The
VDUSE_DEV_GET_CONFIG might be better, but I still did not find a good
way for failure handling.

Thanks,
Yongji
Stefan Hajnoczi July 1, 2021, 1:15 p.m. UTC | #5
On Thu, Jul 01, 2021 at 06:00:48PM +0800, Yongji Xie wrote:
> On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> > > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > > > +     static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > > > > +     {
> > > > > +             int fd;
> > > > > +             void *addr;
> > > > > +             size_t size;
> > > > > +             struct vduse_iotlb_entry entry;
> > > > > +
> > > > > +             entry.start = iova;
> > > > > +             entry.last = iova + 1;
> > > >
> > > > Why +1?
> > > >
> > > > I expected the request to include *len so that VDUSE can create a bounce
> > > > buffer for the full iova range, if necessary.
> > > >
> > >
> > > The function is used to translate iova to va. And the *len is not
> > > specified by the caller. Instead, it's used to tell the caller the
> > > length of the contiguous iova region from the specified iova. And the
> > > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> > > overlapped iova region. So using iova + 1 should be enough here.
> >
> > Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
> > wonder why userspace needs to assign a value at all if it's always +1.
> >
> 
> If we need to get some iova regions in the specified range, we need
> the entry.last field. For example, we can use [0, ULONG_MAX] to get
> the first overlapped iova region which might be [4096, 8192]. But in
> this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to
> get the iova region including the specified iova.

I see, thanks for explaining!

> > > > > +             return addr + iova - entry.start;
> > > > > +     }
> > > > > +
> > > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> > > >
> > > > Are these VIRTIO feature bits? Please explain how feature negotiation
> > > > works. There must be a way for userspace to report the device's
> > > > supported feature bits to the kernel.
> > > >
> > >
> > > Yes, these are VIRTIO feature bits. Userspace will specify the
> > > device's supported feature bits when creating a new VDUSE device with
> > > ioctl(VDUSE_CREATE_DEV).
> >
> > Can the VDUSE device influence feature bit negotiation? For example, if
> > the VDUSE virtio-blk device does not implement discard/write-zeroes, how
> > does QEMU or the guest find out about this?
> >
> 
> There is a "features" field in struct vduse_dev_config which is used
> to do feature negotiation.

This approach is more restrictive than required by the VIRTIO
specification:

  "The device SHOULD accept any valid subset of features the driver
  accepts, otherwise it MUST fail to set the FEATURES_OK device status
  bit when the driver writes it."

  https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-130002

The spec allows a device to reject certain subsets of features. For
example, if feature B depends on feature A and can only be enabled when
feature A is also enabled.

From your description I think VDUSE would accept feature B without
feature A since the device implementation has no opportunity to fail
negotiation with custom logic.

Ideally VDUSE would send a SET_FEATURES message to userspace, allowing
the device implementation full flexibility in which subsets of features
to accept.

This is a corner case. Many or maybe even all existing VIRTIO devices
don't need this flexibility, but I want to point out this limitation in
the VDUSE interface because it may cause issues in the future.

> > > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt
> > > >
> > > > Does this mean the contents of the configuration space are cached by
> > > > VDUSE?
> > >
> > > Yes, but the kernel will also store the same contents.
> > >
> > > > The downside is that the userspace code cannot generate the
> > > > contents on demand. Most devices doin't need to generate the contents
> > > > on demand, so I think this is okay but I had expected a different
> > > > interface:
> > > >
> > > > kernel->userspace VDUSE_DEV_GET_CONFIG
> > > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > > >
> > >
> > > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> > > will need lots of modification of virtio codes to support that. So to
> > > make it simple, we choose this way:
> > >
> > > userspace -> kernel VDUSE_DEV_SET_CONFIG
> > > userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > >
> > > > I think you can leave it the way it is, but I wanted to mention this in
> > > > case someone thinks it's important to support generating the contents of
> > > > the configuration space on demand.
> > > >
> > >
> > > Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and
> > > VDUSE_DEV_INJECT_CONFIG_IRQ achieve that?
> >
> > If the contents of the configuration space change continuously, then the
> > VDUSE_DEV_SET_CONFIG approach is inefficient and might have race
> > conditions. For example, imagine a device where the driver can read a
> > timer from the configuration space. I think the VIRTIO device model
> > allows that although I'm not aware of any devices that do something like
> > it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be
> > called frequently to keep the timer value updated even though the guest
> > driver probably isn't accessing it.
> >
> 
> OK, I get you now. Since the VIRTIO specification says "Device
> configuration space is generally used for rarely-changing or
> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> ioctl should not be called frequently.

The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.

> > What's worse is that there might be race conditions where other
> > driver->device operations are supposed to update the configuration space
> > but VDUSE_DEV_SET_CONFIG means that the VDUSE kernel code is caching an
> > outdated copy.
> >
> 
> I'm not sure. Should the device and driver be able to access the same
> fields concurrently?

Yes. The VIRTIO spec has a generation count to handle multi-field
accesses so that consistency can be ensured:
https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-180004

> 
> > Again, I don't think it's a problem for existing devices in the VIRTIO
> > specification. But I'm not 100% sure and future devices might require
> > what I've described, so the VDUSE_DEV_SET_CONFIG interface could become
> > a problem.
> >
> 
> If so, maybe a new interface can be added at that time. The
> VDUSE_DEV_GET_CONFIG might be better, but I still did not find a good
> way for failure handling.

I'm not aware of the details of why the current approach was necessary,
so I don't have any concrete suggestions. Sorry!

Stefan
Yongji Xie July 4, 2021, 9:49 a.m. UTC | #6
On Thu, Jul 1, 2021 at 9:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Jul 01, 2021 at 06:00:48PM +0800, Yongji Xie wrote:
> > On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> > > > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > > > > +     static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > > > > > +     {
> > > > > > +             int fd;
> > > > > > +             void *addr;
> > > > > > +             size_t size;
> > > > > > +             struct vduse_iotlb_entry entry;
> > > > > > +
> > > > > > +             entry.start = iova;
> > > > > > +             entry.last = iova + 1;
> > > > >
> > > > > Why +1?
> > > > >
> > > > > I expected the request to include *len so that VDUSE can create a bounce
> > > > > buffer for the full iova range, if necessary.
> > > > >
> > > >
> > > > The function is used to translate iova to va. And the *len is not
> > > > specified by the caller. Instead, it's used to tell the caller the
> > > > length of the contiguous iova region from the specified iova. And the
> > > > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> > > > overlapped iova region. So using iova + 1 should be enough here.
> > >
> > > Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
> > > wonder why userspace needs to assign a value at all if it's always +1.
> > >
> >
> > If we need to get some iova regions in the specified range, we need
> > the entry.last field. For example, we can use [0, ULONG_MAX] to get
> > the first overlapped iova region which might be [4096, 8192]. But in
> > this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to
> > get the iova region including the specified iova.
>
> I see, thanks for explaining!
>
> > > > > > +             return addr + iova - entry.start;
> > > > > > +     }
> > > > > > +
> > > > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> > > > >
> > > > > Are these VIRTIO feature bits? Please explain how feature negotiation
> > > > > works. There must be a way for userspace to report the device's
> > > > > supported feature bits to the kernel.
> > > > >
> > > >
> > > > Yes, these are VIRTIO feature bits. Userspace will specify the
> > > > device's supported feature bits when creating a new VDUSE device with
> > > > ioctl(VDUSE_CREATE_DEV).
> > >
> > > Can the VDUSE device influence feature bit negotiation? For example, if
> > > the VDUSE virtio-blk device does not implement discard/write-zeroes, how
> > > does QEMU or the guest find out about this?
> > >
> >
> > There is a "features" field in struct vduse_dev_config which is used
> > to do feature negotiation.
>
> This approach is more restrictive than required by the VIRTIO
> specification:
>
>   "The device SHOULD accept any valid subset of features the driver
>   accepts, otherwise it MUST fail to set the FEATURES_OK device status
>   bit when the driver writes it."
>
>   https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-130002
>
> The spec allows a device to reject certain subsets of features. For
> example, if feature B depends on feature A and can only be enabled when
> feature A is also enabled.
>
> From your description I think VDUSE would accept feature B without
> feature A since the device implementation has no opportunity to fail
> negotiation with custom logic.
>

Yes, we discussed it [1] before. So I'd like to re-introduce
SET_STATUS messages so that the userspace can fail feature negotiation
during setting FEATURES_OK status bit.

[1]  https://lkml.org/lkml/2021/6/28/1587

> Ideally VDUSE would send a SET_FEATURES message to userspace, allowing
> the device implementation full flexibility in which subsets of features
> to accept.
>
> This is a corner case. Many or maybe even all existing VIRTIO devices
> don't need this flexibility, but I want to point out this limitation in
> the VDUSE interface because it may cause issues in the future.
>
> > > > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt
> > > > >
> > > > > Does this mean the contents of the configuration space are cached by
> > > > > VDUSE?
> > > >
> > > > Yes, but the kernel will also store the same contents.
> > > >
> > > > > The downside is that the userspace code cannot generate the
> > > > > contents on demand. Most devices doin't need to generate the contents
> > > > > on demand, so I think this is okay but I had expected a different
> > > > > interface:
> > > > >
> > > > > kernel->userspace VDUSE_DEV_GET_CONFIG
> > > > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > > > >
> > > >
> > > > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> > > > will need lots of modification of virtio codes to support that. So to
> > > > make it simple, we choose this way:
> > > >
> > > > userspace -> kernel VDUSE_DEV_SET_CONFIG
> > > > userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > > >
> > > > > I think you can leave it the way it is, but I wanted to mention this in
> > > > > case someone thinks it's important to support generating the contents of
> > > > > the configuration space on demand.
> > > > >
> > > >
> > > > Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and
> > > > VDUSE_DEV_INJECT_CONFIG_IRQ achieve that?
> > >
> > > If the contents of the configuration space change continuously, then the
> > > VDUSE_DEV_SET_CONFIG approach is inefficient and might have race
> > > conditions. For example, imagine a device where the driver can read a
> > > timer from the configuration space. I think the VIRTIO device model
> > > allows that although I'm not aware of any devices that do something like
> > > it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be
> > > called frequently to keep the timer value updated even though the guest
> > > driver probably isn't accessing it.
> > >
> >
> > OK, I get you now. Since the VIRTIO specification says "Device
> > configuration space is generally used for rarely-changing or
> > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > ioctl should not be called frequently.
>
> The spec uses MUST and other terms to define the precise requirements.
> Here the language (especially the word "generally") is weaker and means
> there may be exceptions.
>
> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> approach is reads that have side-effects. For example, imagine a field
> containing an error code if the device encounters a problem unrelated to
> a specific virtqueue request. Reading from this field resets the error
> code to 0, saving the driver an extra configuration space write access
> and possibly race conditions. It isn't possible to implement those
> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> makes me think that the interface does not allow full VIRTIO semantics.
>

Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
handle the message failure, I'm going to add a return value to
virtio_config_ops.get() and virtio_cread_* API so that the error can
be propagated to the virtio device driver. Then the virtio-blk device
driver can be modified to handle that.

Jason and Stefan, what do you think of this way?

> > > What's worse is that there might be race conditions where other
> > > driver->device operations are supposed to update the configuration space
> > > but VDUSE_DEV_SET_CONFIG means that the VDUSE kernel code is caching an
> > > outdated copy.
> > >
> >
> > I'm not sure. Should the device and driver be able to access the same
> > fields concurrently?
>
> Yes. The VIRTIO spec has a generation count to handle multi-field
> accesses so that consistency can be ensured:
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-180004
>

I see.

Thanks,
Yongji
Jason Wang July 5, 2021, 3:36 a.m. UTC | #7
在 2021/7/4 下午5:49, Yongji Xie 写道:
>>> OK, I get you now. Since the VIRTIO specification says "Device
>>> configuration space is generally used for rarely-changing or
>>> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
>>> ioctl should not be called frequently.
>> The spec uses MUST and other terms to define the precise requirements.
>> Here the language (especially the word "generally") is weaker and means
>> there may be exceptions.
>>
>> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
>> approach is reads that have side-effects. For example, imagine a field
>> containing an error code if the device encounters a problem unrelated to
>> a specific virtqueue request. Reading from this field resets the error
>> code to 0, saving the driver an extra configuration space write access
>> and possibly race conditions. It isn't possible to implement those
>> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
>> makes me think that the interface does not allow full VIRTIO semantics.


Note that though you're correct, my understanding is that config space 
is not suitable for this kind of error propagating. And it would be very 
hard to implement such kind of semantic in some transports.  Virtqueue 
should be much better. As Yong Ji quoted, the config space is used for 
"rarely-changing or intialization-time parameters".


> Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> handle the message failure, I'm going to add a return value to
> virtio_config_ops.get() and virtio_cread_* API so that the error can
> be propagated to the virtio device driver. Then the virtio-blk device
> driver can be modified to handle that.
>
> Jason and Stefan, what do you think of this way?


I'd like to stick to the current assumption thich get_config won't fail. 
That is to say,

1) maintain a config in the kernel, make sure the config space read can 
always succeed
2) introduce an ioctl for the vduse usersapce to update the config space.
3) we can synchronize with the vduse userspace during set_config

Does this work?

Thanks


>
Stefan Hajnoczi July 5, 2021, 12:49 p.m. UTC | #8
On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> 
> 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > configuration space is generally used for rarely-changing or
> > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > ioctl should not be called frequently.
> > > The spec uses MUST and other terms to define the precise requirements.
> > > Here the language (especially the word "generally") is weaker and means
> > > there may be exceptions.
> > > 
> > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > approach is reads that have side-effects. For example, imagine a field
> > > containing an error code if the device encounters a problem unrelated to
> > > a specific virtqueue request. Reading from this field resets the error
> > > code to 0, saving the driver an extra configuration space write access
> > > and possibly race conditions. It isn't possible to implement those
> > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > makes me think that the interface does not allow full VIRTIO semantics.
> 
> 
> Note that though you're correct, my understanding is that config space is
> not suitable for this kind of error propagating. And it would be very hard
> to implement such kind of semantic in some transports.  Virtqueue should be
> much better. As Yong Ji quoted, the config space is used for
> "rarely-changing or intialization-time parameters".
> 
> 
> > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > handle the message failure, I'm going to add a return value to
> > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > be propagated to the virtio device driver. Then the virtio-blk device
> > driver can be modified to handle that.
> > 
> > Jason and Stefan, what do you think of this way?

Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.

> I'd like to stick to the current assumption thich get_config won't fail.
> That is to say,
> 
> 1) maintain a config in the kernel, make sure the config space read can
> always succeed
> 2) introduce an ioctl for the vduse usersapce to update the config space.
> 3) we can synchronize with the vduse userspace during set_config
> 
> Does this work?

I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?

Here are the vhost-user protocol messages:

  Virtio device config space
  ^^^^^^^^^^^^^^^^^^^^^^^^^^

  +--------+------+-------+---------+
  | offset | size | flags | payload |
  +--------+------+-------+---------+

  :offset: a 32-bit offset of virtio device's configuration space

  :size: a 32-bit configuration space access size in bytes

  :flags: a 32-bit value:
    - 0: Vhost master messages used for writeable fields
    - 1: Vhost master messages used for live migration

  :payload: Size bytes array holding the contents of the virtio
            device's configuration space

  ...

  ``VHOST_USER_GET_CONFIG``
    :id: 24
    :equivalent ioctl: N/A
    :master payload: virtio device config space
    :slave payload: virtio device config space

    When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
    submitted by the vhost-user master to fetch the contents of the
    virtio device configuration space, vhost-user slave's payload size
    MUST match master's request, vhost-user slave uses zero length of
    payload to indicate an error to vhost-user master. The vhost-user
    master may cache the contents to avoid repeated
    ``VHOST_USER_GET_CONFIG`` calls.

  ``VHOST_USER_SET_CONFIG``
    :id: 25
    :equivalent ioctl: N/A
    :master payload: virtio device config space
    :slave payload: N/A

    When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
    submitted by the vhost-user master when the Guest changes the virtio
    device configuration space and also can be used for live migration
    on the destination host. The vhost-user slave must check the flags
    field, and slaves MUST NOT accept SET_CONFIG for read-only
    configuration space fields unless the live migration bit is set.

Stefan
Jason Wang July 6, 2021, 2:34 a.m. UTC | #9
在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
>> 在 2021/7/4 下午5:49, Yongji Xie 写道:
>>>>> OK, I get you now. Since the VIRTIO specification says "Device
>>>>> configuration space is generally used for rarely-changing or
>>>>> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
>>>>> ioctl should not be called frequently.
>>>> The spec uses MUST and other terms to define the precise requirements.
>>>> Here the language (especially the word "generally") is weaker and means
>>>> there may be exceptions.
>>>>
>>>> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
>>>> approach is reads that have side-effects. For example, imagine a field
>>>> containing an error code if the device encounters a problem unrelated to
>>>> a specific virtqueue request. Reading from this field resets the error
>>>> code to 0, saving the driver an extra configuration space write access
>>>> and possibly race conditions. It isn't possible to implement those
>>>> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
>>>> makes me think that the interface does not allow full VIRTIO semantics.
>>
>> Note that though you're correct, my understanding is that config space is
>> not suitable for this kind of error propagating. And it would be very hard
>> to implement such kind of semantic in some transports.  Virtqueue should be
>> much better. As Yong Ji quoted, the config space is used for
>> "rarely-changing or intialization-time parameters".
>>
>>
>>> Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
>>> handle the message failure, I'm going to add a return value to
>>> virtio_config_ops.get() and virtio_cread_* API so that the error can
>>> be propagated to the virtio device driver. Then the virtio-blk device
>>> driver can be modified to handle that.
>>>
>>> Jason and Stefan, what do you think of this way?
> Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
>
> The VIRTIO spec provides no way for the device to report errors from
> config space accesses.
>
> The QEMU virtio-pci implementation returns -1 from invalid
> virtio_config_read*() and silently discards virtio_config_write*()
> accesses.
>
> VDUSE can take the same approach with
> VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
>
>> I'd like to stick to the current assumption thich get_config won't fail.
>> That is to say,
>>
>> 1) maintain a config in the kernel, make sure the config space read can
>> always succeed
>> 2) introduce an ioctl for the vduse usersapce to update the config space.
>> 3) we can synchronize with the vduse userspace during set_config
>>
>> Does this work?
> I noticed that caching is also allowed by the vhost-user protocol
> messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
> know whether or not caching is in effect. The interface you outlined
> above requires caching.
>
> Is there a reason why the host kernel vDPA code needs to cache the
> configuration space?


Because:

1) Kernel can not wait forever in get_config(), this is the major 
difference with vhost-user.
2) Stick to the current assumption that virtio_cread() should always 
succeed.

Thanks


>
> Here are the vhost-user protocol messages:
>
>    Virtio device config space
>    ^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>    +--------+------+-------+---------+
>    | offset | size | flags | payload |
>    +--------+------+-------+---------+
>
>    :offset: a 32-bit offset of virtio device's configuration space
>
>    :size: a 32-bit configuration space access size in bytes
>
>    :flags: a 32-bit value:
>      - 0: Vhost master messages used for writeable fields
>      - 1: Vhost master messages used for live migration
>
>    :payload: Size bytes array holding the contents of the virtio
>              device's configuration space
>
>    ...
>
>    ``VHOST_USER_GET_CONFIG``
>      :id: 24
>      :equivalent ioctl: N/A
>      :master payload: virtio device config space
>      :slave payload: virtio device config space
>
>      When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
>      submitted by the vhost-user master to fetch the contents of the
>      virtio device configuration space, vhost-user slave's payload size
>      MUST match master's request, vhost-user slave uses zero length of
>      payload to indicate an error to vhost-user master. The vhost-user
>      master may cache the contents to avoid repeated
>      ``VHOST_USER_GET_CONFIG`` calls.
>
>    ``VHOST_USER_SET_CONFIG``
>      :id: 25
>      :equivalent ioctl: N/A
>      :master payload: virtio device config space
>      :slave payload: N/A
>
>      When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
>      submitted by the vhost-user master when the Guest changes the virtio
>      device configuration space and also can be used for live migration
>      on the destination host. The vhost-user slave must check the flags
>      field, and slaves MUST NOT accept SET_CONFIG for read-only
>      configuration space fields unless the live migration bit is set.
>
> Stefan
Yongji Xie July 6, 2021, 3:04 a.m. UTC | #10
On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> >
> > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > configuration space is generally used for rarely-changing or
> > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > ioctl should not be called frequently.
> > > > The spec uses MUST and other terms to define the precise requirements.
> > > > Here the language (especially the word "generally") is weaker and means
> > > > there may be exceptions.
> > > >
> > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > approach is reads that have side-effects. For example, imagine a field
> > > > containing an error code if the device encounters a problem unrelated to
> > > > a specific virtqueue request. Reading from this field resets the error
> > > > code to 0, saving the driver an extra configuration space write access
> > > > and possibly race conditions. It isn't possible to implement those
> > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > makes me think that the interface does not allow full VIRTIO semantics.
> >
> >
> > Note that though you're correct, my understanding is that config space is
> > not suitable for this kind of error propagating. And it would be very hard
> > to implement such kind of semantic in some transports.  Virtqueue should be
> > much better. As Yong Ji quoted, the config space is used for
> > "rarely-changing or intialization-time parameters".
> >
> >
> > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > handle the message failure, I'm going to add a return value to
> > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > be propagated to the virtio device driver. Then the virtio-blk device
> > > driver can be modified to handle that.
> > >
> > > Jason and Stefan, what do you think of this way?
>
> Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
>

We add a timeout and return error in case userspace never replies to
the message.

> The VIRTIO spec provides no way for the device to report errors from
> config space accesses.
>
> The QEMU virtio-pci implementation returns -1 from invalid
> virtio_config_read*() and silently discards virtio_config_write*()
> accesses.
>
> VDUSE can take the same approach with
> VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
>

I noticed that virtio_config_read*() only returns -1 when we access a
invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
when we access a valid field. Not sure if it's ok to silently ignore
this kind of error.

Thanks,
Yongji
Stefan Hajnoczi July 6, 2021, 10:14 a.m. UTC | #11
On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> 
> 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > configuration space is generally used for rarely-changing or
> > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > ioctl should not be called frequently.
> > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > Here the language (especially the word "generally") is weaker and means
> > > > > there may be exceptions.
> > > > > 
> > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > containing an error code if the device encounters a problem unrelated to
> > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > code to 0, saving the driver an extra configuration space write access
> > > > > and possibly race conditions. It isn't possible to implement those
> > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > makes me think that the interface does not allow full VIRTIO semantics.
> > > 
> > > Note that though you're correct, my understanding is that config space is
> > > not suitable for this kind of error propagating. And it would be very hard
> > > to implement such kind of semantic in some transports.  Virtqueue should be
> > > much better. As Yong Ji quoted, the config space is used for
> > > "rarely-changing or intialization-time parameters".
> > > 
> > > 
> > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > handle the message failure, I'm going to add a return value to
> > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > driver can be modified to handle that.
> > > > 
> > > > Jason and Stefan, what do you think of this way?
> > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > 
> > The VIRTIO spec provides no way for the device to report errors from
> > config space accesses.
> > 
> > The QEMU virtio-pci implementation returns -1 from invalid
> > virtio_config_read*() and silently discards virtio_config_write*()
> > accesses.
> > 
> > VDUSE can take the same approach with
> > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > 
> > > I'd like to stick to the current assumption thich get_config won't fail.
> > > That is to say,
> > > 
> > > 1) maintain a config in the kernel, make sure the config space read can
> > > always succeed
> > > 2) introduce an ioctl for the vduse usersapce to update the config space.
> > > 3) we can synchronize with the vduse userspace during set_config
> > > 
> > > Does this work?
> > I noticed that caching is also allowed by the vhost-user protocol
> > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
> > know whether or not caching is in effect. The interface you outlined
> > above requires caching.
> > 
> > Is there a reason why the host kernel vDPA code needs to cache the
> > configuration space?
> 
> 
> Because:
> 
> 1) Kernel can not wait forever in get_config(), this is the major difference
> with vhost-user.

virtio_cread() can sleep:

  #define virtio_cread(vdev, structname, member, ptr)                     \
          do {                                                            \
                  typeof(((structname*)0)->member) virtio_cread_v;        \
                                                                          \
                  might_sleep();                                          \
                  ^^^^^^^^^^^^^^

Which code path cannot sleep?

> 2) Stick to the current assumption that virtio_cread() should always
> succeed.

That can be done by reading -1 (like QEMU does) when the read fails.

Stefan
Stefan Hajnoczi July 6, 2021, 10:22 a.m. UTC | #12
On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote:
> On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > >
> > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > configuration space is generally used for rarely-changing or
> > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > ioctl should not be called frequently.
> > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > Here the language (especially the word "generally") is weaker and means
> > > > > there may be exceptions.
> > > > >
> > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > containing an error code if the device encounters a problem unrelated to
> > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > code to 0, saving the driver an extra configuration space write access
> > > > > and possibly race conditions. It isn't possible to implement those
> > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > makes me think that the interface does not allow full VIRTIO semantics.
> > >
> > >
> > > Note that though you're correct, my understanding is that config space is
> > > not suitable for this kind of error propagating. And it would be very hard
> > > to implement such kind of semantic in some transports.  Virtqueue should be
> > > much better. As Yong Ji quoted, the config space is used for
> > > "rarely-changing or intialization-time parameters".
> > >
> > >
> > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > handle the message failure, I'm going to add a return value to
> > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > driver can be modified to handle that.
> > > >
> > > > Jason and Stefan, what do you think of this way?
> >
> > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> >
> 
> We add a timeout and return error in case userspace never replies to
> the message.
> 
> > The VIRTIO spec provides no way for the device to report errors from
> > config space accesses.
> >
> > The QEMU virtio-pci implementation returns -1 from invalid
> > virtio_config_read*() and silently discards virtio_config_write*()
> > accesses.
> >
> > VDUSE can take the same approach with
> > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> >
> 
> I noticed that virtio_config_read*() only returns -1 when we access a
> invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
> when we access a valid field. Not sure if it's ok to silently ignore
> this kind of error.

That's a good point but it's a general VIRTIO issue. Any device
implementation (QEMU userspace, hardware vDPA, etc) can fail, so the
VIRTIO specification needs to provide a way for the driver to detect
this.

If userspace violates the contract then VDUSE needs to mark the device
broken. QEMU's device emulation does something similar with the
vdev->broken flag.

The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by
vDPA/VDUSE to indicate that the device is not operational and must be
reset.

The driver code may still process the -1 value read from the
configuration space. Hopefully this isn't a problem. There is currently
no VIRTIO interface besides DEVICE_NEEDS_RESET to indicate configuration
space access failures. On the other hand, drivers need to handle
malicious devices so they should be able to cope with the -1 value
anyway.

Stefan
Yongji Xie July 7, 2021, 9:09 a.m. UTC | #13
On Tue, Jul 6, 2021 at 6:22 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote:
> > On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > >
> > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > > ioctl should not be called frequently.
> > > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > > Here the language (especially the word "generally") is weaker and means
> > > > > > there may be exceptions.
> > > > > >
> > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > > containing an error code if the device encounters a problem unrelated to
> > > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > > code to 0, saving the driver an extra configuration space write access
> > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > > makes me think that the interface does not allow full VIRTIO semantics.
> > > >
> > > >
> > > > Note that though you're correct, my understanding is that config space is
> > > > not suitable for this kind of error propagating. And it would be very hard
> > > > to implement such kind of semantic in some transports.  Virtqueue should be
> > > > much better. As Yong Ji quoted, the config space is used for
> > > > "rarely-changing or intialization-time parameters".
> > > >
> > > >
> > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > handle the message failure, I'm going to add a return value to
> > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > > driver can be modified to handle that.
> > > > >
> > > > > Jason and Stefan, what do you think of this way?
> > >
> > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > >
> >
> > We add a timeout and return error in case userspace never replies to
> > the message.
> >
> > > The VIRTIO spec provides no way for the device to report errors from
> > > config space accesses.
> > >
> > > The QEMU virtio-pci implementation returns -1 from invalid
> > > virtio_config_read*() and silently discards virtio_config_write*()
> > > accesses.
> > >
> > > VDUSE can take the same approach with
> > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > >
> >
> > I noticed that virtio_config_read*() only returns -1 when we access a
> > invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
> > when we access a valid field. Not sure if it's ok to silently ignore
> > this kind of error.
>
> That's a good point but it's a general VIRTIO issue. Any device
> implementation (QEMU userspace, hardware vDPA, etc) can fail, so the
> VIRTIO specification needs to provide a way for the driver to detect
> this.
>
> If userspace violates the contract then VDUSE needs to mark the device
> broken. QEMU's device emulation does something similar with the
> vdev->broken flag.
>
> The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by
> vDPA/VDUSE to indicate that the device is not operational and must be
> reset.
>

It might be a solution. But DEVICE_NEEDS_RESET  is not implemented
currently. So I'm thinking whether it's ok to add a check of
DEVICE_NEEDS_RESET status bit in probe function of virtio device
driver (e.g. virtio-blk driver). Then VDUSE can make use of it to fail
device initailization when configuration space access failed.

Thanks,
Yongji
Jason Wang July 7, 2021, 9:24 a.m. UTC | #14
在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
>> 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
>>> On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
>>>> On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>> On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
>>>>>>> On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
>>>>>>>> 在 2021/7/4 下午5:49, Yongji Xie 写道:
>>>>>>>>>>> OK, I get you now. Since the VIRTIO specification says "Device
>>>>>>>>>>> configuration space is generally used for rarely-changing or
>>>>>>>>>>> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
>>>>>>>>>>> ioctl should not be called frequently.
>>>>>>>>>> The spec uses MUST and other terms to define the precise requirements.
>>>>>>>>>> Here the language (especially the word "generally") is weaker and means
>>>>>>>>>> there may be exceptions.
>>>>>>>>>>
>>>>>>>>>> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
>>>>>>>>>> approach is reads that have side-effects. For example, imagine a field
>>>>>>>>>> containing an error code if the device encounters a problem unrelated to
>>>>>>>>>> a specific virtqueue request. Reading from this field resets the error
>>>>>>>>>> code to 0, saving the driver an extra configuration space write access
>>>>>>>>>> and possibly race conditions. It isn't possible to implement those
>>>>>>>>>> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
>>>>>>>>>> makes me think that the interface does not allow full VIRTIO semantics.
>>>>>>>> Note that though you're correct, my understanding is that config space is
>>>>>>>> not suitable for this kind of error propagating. And it would be very hard
>>>>>>>> to implement such kind of semantic in some transports.  Virtqueue should be
>>>>>>>> much better. As Yong Ji quoted, the config space is used for
>>>>>>>> "rarely-changing or intialization-time parameters".
>>>>>>>>
>>>>>>>>
>>>>>>>>> Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
>>>>>>>>> handle the message failure, I'm going to add a return value to
>>>>>>>>> virtio_config_ops.get() and virtio_cread_* API so that the error can
>>>>>>>>> be propagated to the virtio device driver. Then the virtio-blk device
>>>>>>>>> driver can be modified to handle that.
>>>>>>>>>
>>>>>>>>> Jason and Stefan, what do you think of this way?
>>>>>>> Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
>>>>>>>
>>>>>>> The VIRTIO spec provides no way for the device to report errors from
>>>>>>> config space accesses.
>>>>>>>
>>>>>>> The QEMU virtio-pci implementation returns -1 from invalid
>>>>>>> virtio_config_read*() and silently discards virtio_config_write*()
>>>>>>> accesses.
>>>>>>>
>>>>>>> VDUSE can take the same approach with
>>>>>>> VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
>>>>>>>
>>>>>>>> I'd like to stick to the current assumption thich get_config won't fail.
>>>>>>>> That is to say,
>>>>>>>>
>>>>>>>> 1) maintain a config in the kernel, make sure the config space read can
>>>>>>>> always succeed
>>>>>>>> 2) introduce an ioctl for the vduse usersapce to update the config space.
>>>>>>>> 3) we can synchronize with the vduse userspace during set_config
>>>>>>>>
>>>>>>>> Does this work?
>>>>>>> I noticed that caching is also allowed by the vhost-user protocol
>>>>>>> messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
>>>>>>> know whether or not caching is in effect. The interface you outlined
>>>>>>> above requires caching.
>>>>>>>
>>>>>>> Is there a reason why the host kernel vDPA code needs to cache the
>>>>>>> configuration space?
>>>>>> Because:
>>>>>>
>>>>>> 1) Kernel can not wait forever in get_config(), this is the major difference
>>>>>> with vhost-user.
>>>>> virtio_cread() can sleep:
>>>>>
>>>>>     #define virtio_cread(vdev, structname, member, ptr)                     \
>>>>>             do {                                                            \
>>>>>                     typeof(((structname*)0)->member) virtio_cread_v;        \
>>>>>                                                                             \
>>>>>                     might_sleep();                                          \
>>>>>                     ^^^^^^^^^^^^^^
>>>>>
>>>>> Which code path cannot sleep?
>>>> Well, it can sleep but it can't sleep forever. For VDUSE, a
>>>> buggy/malicious userspace may refuse to respond to the get_config.
>>>>
>>>> It looks to me the ideal case, with the current virtio spec, for VDUSE is to
>>>>
>>>> 1) maintain the device and its state in the kernel, userspace may sync
>>>> with the kernel device via ioctls
>>>> 2) offload the datapath (virtqueue) to the userspace
>>>>
>>>> This seems more robust and safe than simply relaying everything to
>>>> userspace and waiting for its response.
>>>>
>>>> And we know for sure this model can work, an example is TUN/TAP:
>>>> netdevice is abstracted in the kernel and datapath is done via
>>>> sendmsg()/recvmsg().
>>>>
>>>> Maintaining the config in the kernel follows this model and it can
>>>> simplify the device generation implementation.
>>>>
>>>> For config space write, it requires more thought but fortunately it's
>>>> not commonly used. So VDUSE can choose to filter out the
>>>> device/features that depends on the config write.
>>> This is the problem. There are other messages like SET_FEATURES where I
>>> guess we'll face the same challenge.
>>
>> Probably not, userspace device can tell the kernel about the device_features
>> and mandated_features during creation, and the feature negotiation could be
>> done purely in the kernel without bothering the userspace.


(For some reason I drop the list accidentally, adding them back, sorry)


> Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous
> interface where the driver waits for the device.


It depends on how we define "synchronous" here. If I understand 
correctly, the spec doesn't expect there will be any kind of failure for 
the operation of set_status itself.

Instead, anytime it want any synchronization, it should be done via 
get_status():

1) re-read device status to make sure FEATURES_OK is set during feature 
negotiation
2) re-read device status to be 0 to make sure the device has finish the 
reset


>
> VDUSE currently doesn't wait for the device emulation process to handle
> this message (no reply is needed) but I think this is a mistake because
> VDUSE is not following the VIRTIO device model.


With the trick that is done for FEATURES_OK above, I think we don't need 
to wait for the reply.

If userspace takes too long to respond, it can be detected since 
get_status() doesn't return the expected value for long time.

And for the case that needs a timeout, we probably can use NEEDS_RESET.


>
> I strongly suggest designing the VDUSE interface to match the VIRTIO
> device model (or at least the vDPA interface).


I fully agree with you and that is what we want to achieve in this series.


> Defining a custom
> interface for VDUSE avoids some implementation complexity and makes it
> easier to deal with untrusted userspace, but it's impossible to
> implement certain VIRTIO features or devices. It also fragments VIRTIO
> more than necessary; we have a standard, let's stick to it.


Yes.


>
>>> I agree that caching the contents of configuration space in the kernel
>>> helps, but if there are other VDUSE messages with the same problem then
>>> an attacker will exploit them instead.
>>>
>>> I think a systematic solution is needed. It would be necessary to
>>> enumerate the virtio_vdpa and vhost_vdpa cases separately to figure out
>>> where VDUSE messages are synchronous/time-sensitive.
>>
>> This is the case of reset and needs more thought. We should stick a
>> consistent uAPI for the userspace.
>>
>> For vhost-vDPA, it needs synchronzied with the userspace and we can wait for
>> ever.
> The VMM should still be able to handle signals when a vhost_vdpa ioctl
> is waiting for a reply from the VDUSE userspace process. Or if that's
> not possible then there needs to be a way to force disconnection from
> VDUSE so the VMM can be killed.


Note that VDUSE works under vDPA bus, so vhost should be transport to VDUSE.

But we can detect this via whether or not the bounce buffer is used.

Thanks


>
> Stefan
Stefan Hajnoczi July 7, 2021, 3:54 p.m. UTC | #15
On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> 
> 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > > > > > > > Here the language (especially the word "generally") is weaker and means
> > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > 
> > > > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > > > > > > > containing an error code if the device encounters a problem unrelated to
> > > > > > > > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > > > > > > > code to 0, saving the driver an extra configuration space write access
> > > > > > > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > > > > > > > makes me think that the interface does not allow full VIRTIO semantics.
> > > > > > > > > Note that though you're correct, my understanding is that config space is
> > > > > > > > > not suitable for this kind of error propagating. And it would be very hard
> > > > > > > > > to implement such kind of semantic in some transports.  Virtqueue should be
> > > > > > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > > > > > > handle the message failure, I'm going to add a return value to
> > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > > > > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > > > > > > > driver can be modified to handle that.
> > > > > > > > > > 
> > > > > > > > > > Jason and Stefan, what do you think of this way?
> > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > > > > > > > 
> > > > > > > > The VIRTIO spec provides no way for the device to report errors from
> > > > > > > > config space accesses.
> > > > > > > > 
> > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > > > > > virtio_config_read*() and silently discards virtio_config_write*()
> > > > > > > > accesses.
> > > > > > > > 
> > > > > > > > VDUSE can take the same approach with
> > > > > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > > > > > > 
> > > > > > > > > I'd like to stick to the current assumption thich get_config won't fail.
> > > > > > > > > That is to say,
> > > > > > > > > 
> > > > > > > > > 1) maintain a config in the kernel, make sure the config space read can
> > > > > > > > > always succeed
> > > > > > > > > 2) introduce an ioctl for the vduse usersapce to update the config space.
> > > > > > > > > 3) we can synchronize with the vduse userspace during set_config
> > > > > > > > > 
> > > > > > > > > Does this work?
> > > > > > > > I noticed that caching is also allowed by the vhost-user protocol
> > > > > > > > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
> > > > > > > > know whether or not caching is in effect. The interface you outlined
> > > > > > > > above requires caching.
> > > > > > > > 
> > > > > > > > Is there a reason why the host kernel vDPA code needs to cache the
> > > > > > > > configuration space?
> > > > > > > Because:
> > > > > > > 
> > > > > > > 1) Kernel can not wait forever in get_config(), this is the major difference
> > > > > > > with vhost-user.
> > > > > > virtio_cread() can sleep:
> > > > > > 
> > > > > >     #define virtio_cread(vdev, structname, member, ptr)                     \
> > > > > >             do {                                                            \
> > > > > >                     typeof(((structname*)0)->member) virtio_cread_v;        \
> > > > > >                                                                             \
> > > > > >                     might_sleep();                                          \
> > > > > >                     ^^^^^^^^^^^^^^
> > > > > > 
> > > > > > Which code path cannot sleep?
> > > > > Well, it can sleep but it can't sleep forever. For VDUSE, a
> > > > > buggy/malicious userspace may refuse to respond to the get_config.
> > > > > 
> > > > > It looks to me the ideal case, with the current virtio spec, for VDUSE is to
> > > > > 
> > > > > 1) maintain the device and its state in the kernel, userspace may sync
> > > > > with the kernel device via ioctls
> > > > > 2) offload the datapath (virtqueue) to the userspace
> > > > > 
> > > > > This seems more robust and safe than simply relaying everything to
> > > > > userspace and waiting for its response.
> > > > > 
> > > > > And we know for sure this model can work, an example is TUN/TAP:
> > > > > netdevice is abstracted in the kernel and datapath is done via
> > > > > sendmsg()/recvmsg().
> > > > > 
> > > > > Maintaining the config in the kernel follows this model and it can
> > > > > simplify the device generation implementation.
> > > > > 
> > > > > For config space write, it requires more thought but fortunately it's
> > > > > not commonly used. So VDUSE can choose to filter out the
> > > > > device/features that depends on the config write.
> > > > This is the problem. There are other messages like SET_FEATURES where I
> > > > guess we'll face the same challenge.
> > > 
> > > Probably not, userspace device can tell the kernel about the device_features
> > > and mandated_features during creation, and the feature negotiation could be
> > > done purely in the kernel without bothering the userspace.
> 
> 
> (For some reason I drop the list accidentally, adding them back, sorry)
> 
> 
> > Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous
> > interface where the driver waits for the device.
> 
> 
> It depends on how we define "synchronous" here. If I understand correctly,
> the spec doesn't expect there will be any kind of failure for the operation
> of set_status itself.
> 
> Instead, anytime it want any synchronization, it should be done via
> get_status():
> 
> 1) re-read device status to make sure FEATURES_OK is set during feature
> negotiation
> 2) re-read device status to be 0 to make sure the device has finish the
> reset
> 
> 
> > 
> > VDUSE currently doesn't wait for the device emulation process to handle
> > this message (no reply is needed) but I think this is a mistake because
> > VDUSE is not following the VIRTIO device model.
> 
> 
> With the trick that is done for FEATURES_OK above, I think we don't need to
> wait for the reply.
> 
> If userspace takes too long to respond, it can be detected since
> get_status() doesn't return the expected value for long time.
> 
> And for the case that needs a timeout, we probably can use NEEDS_RESET.

I think you're right. get_status is the synchronization point, not
set_status.

Currently there is no VDUSE GET_STATUS message. The
VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so
that the device emulation program can participate in emulating the
Device Status field. This change could affect VDUSE's VIRTIO feature
interface since the device emulation program can reject features by not
setting FEATURES_OK.

Stefan
Jason Wang July 8, 2021, 4:17 a.m. UTC | #16
在 2021/7/7 下午11:54, Stefan Hajnoczi 写道:
> On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
>> 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
>>> On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
>>>> 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
>>>>> On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
>>>>>> On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>> On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
>>>>>>>> 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
>>>>>>>>> On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
>>>>>>>>>> 在 2021/7/4 下午5:49, Yongji Xie 写道:
>>>>>>>>>>>>> OK, I get you now. Since the VIRTIO specification says "Device
>>>>>>>>>>>>> configuration space is generally used for rarely-changing or
>>>>>>>>>>>>> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
>>>>>>>>>>>>> ioctl should not be called frequently.
>>>>>>>>>>>> The spec uses MUST and other terms to define the precise requirements.
>>>>>>>>>>>> Here the language (especially the word "generally") is weaker and means
>>>>>>>>>>>> there may be exceptions.
>>>>>>>>>>>>
>>>>>>>>>>>> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
>>>>>>>>>>>> approach is reads that have side-effects. For example, imagine a field
>>>>>>>>>>>> containing an error code if the device encounters a problem unrelated to
>>>>>>>>>>>> a specific virtqueue request. Reading from this field resets the error
>>>>>>>>>>>> code to 0, saving the driver an extra configuration space write access
>>>>>>>>>>>> and possibly race conditions. It isn't possible to implement those
>>>>>>>>>>>> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
>>>>>>>>>>>> makes me think that the interface does not allow full VIRTIO semantics.
>>>>>>>>>> Note that though you're correct, my understanding is that config space is
>>>>>>>>>> not suitable for this kind of error propagating. And it would be very hard
>>>>>>>>>> to implement such kind of semantic in some transports.  Virtqueue should be
>>>>>>>>>> much better. As Yong Ji quoted, the config space is used for
>>>>>>>>>> "rarely-changing or intialization-time parameters".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
>>>>>>>>>>> handle the message failure, I'm going to add a return value to
>>>>>>>>>>> virtio_config_ops.get() and virtio_cread_* API so that the error can
>>>>>>>>>>> be propagated to the virtio device driver. Then the virtio-blk device
>>>>>>>>>>> driver can be modified to handle that.
>>>>>>>>>>>
>>>>>>>>>>> Jason and Stefan, what do you think of this way?
>>>>>>>>> Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
>>>>>>>>>
>>>>>>>>> The VIRTIO spec provides no way for the device to report errors from
>>>>>>>>> config space accesses.
>>>>>>>>>
>>>>>>>>> The QEMU virtio-pci implementation returns -1 from invalid
>>>>>>>>> virtio_config_read*() and silently discards virtio_config_write*()
>>>>>>>>> accesses.
>>>>>>>>>
>>>>>>>>> VDUSE can take the same approach with
>>>>>>>>> VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
>>>>>>>>>
>>>>>>>>>> I'd like to stick to the current assumption thich get_config won't fail.
>>>>>>>>>> That is to say,
>>>>>>>>>>
>>>>>>>>>> 1) maintain a config in the kernel, make sure the config space read can
>>>>>>>>>> always succeed
>>>>>>>>>> 2) introduce an ioctl for the vduse usersapce to update the config space.
>>>>>>>>>> 3) we can synchronize with the vduse userspace during set_config
>>>>>>>>>>
>>>>>>>>>> Does this work?
>>>>>>>>> I noticed that caching is also allowed by the vhost-user protocol
>>>>>>>>> messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
>>>>>>>>> know whether or not caching is in effect. The interface you outlined
>>>>>>>>> above requires caching.
>>>>>>>>>
>>>>>>>>> Is there a reason why the host kernel vDPA code needs to cache the
>>>>>>>>> configuration space?
>>>>>>>> Because:
>>>>>>>>
>>>>>>>> 1) Kernel can not wait forever in get_config(), this is the major difference
>>>>>>>> with vhost-user.
>>>>>>> virtio_cread() can sleep:
>>>>>>>
>>>>>>>      #define virtio_cread(vdev, structname, member, ptr)                     \
>>>>>>>              do {                                                            \
>>>>>>>                      typeof(((structname*)0)->member) virtio_cread_v;        \
>>>>>>>                                                                              \
>>>>>>>                      might_sleep();                                          \
>>>>>>>                      ^^^^^^^^^^^^^^
>>>>>>>
>>>>>>> Which code path cannot sleep?
>>>>>> Well, it can sleep but it can't sleep forever. For VDUSE, a
>>>>>> buggy/malicious userspace may refuse to respond to the get_config.
>>>>>>
>>>>>> It looks to me the ideal case, with the current virtio spec, for VDUSE is to
>>>>>>
>>>>>> 1) maintain the device and its state in the kernel, userspace may sync
>>>>>> with the kernel device via ioctls
>>>>>> 2) offload the datapath (virtqueue) to the userspace
>>>>>>
>>>>>> This seems more robust and safe than simply relaying everything to
>>>>>> userspace and waiting for its response.
>>>>>>
>>>>>> And we know for sure this model can work, an example is TUN/TAP:
>>>>>> netdevice is abstracted in the kernel and datapath is done via
>>>>>> sendmsg()/recvmsg().
>>>>>>
>>>>>> Maintaining the config in the kernel follows this model and it can
>>>>>> simplify the device generation implementation.
>>>>>>
>>>>>> For config space write, it requires more thought but fortunately it's
>>>>>> not commonly used. So VDUSE can choose to filter out the
>>>>>> device/features that depends on the config write.
>>>>> This is the problem. There are other messages like SET_FEATURES where I
>>>>> guess we'll face the same challenge.
>>>> Probably not, userspace device can tell the kernel about the device_features
>>>> and mandated_features during creation, and the feature negotiation could be
>>>> done purely in the kernel without bothering the userspace.
>>
>> (For some reason I drop the list accidentally, adding them back, sorry)
>>
>>
>>> Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous
>>> interface where the driver waits for the device.
>>
>> It depends on how we define "synchronous" here. If I understand correctly,
>> the spec doesn't expect there will be any kind of failure for the operation
>> of set_status itself.
>>
>> Instead, anytime it want any synchronization, it should be done via
>> get_status():
>>
>> 1) re-read device status to make sure FEATURES_OK is set during feature
>> negotiation
>> 2) re-read device status to be 0 to make sure the device has finish the
>> reset
>>
>>
>>> VDUSE currently doesn't wait for the device emulation process to handle
>>> this message (no reply is needed) but I think this is a mistake because
>>> VDUSE is not following the VIRTIO device model.
>>
>> With the trick that is done for FEATURES_OK above, I think we don't need to
>> wait for the reply.
>>
>> If userspace takes too long to respond, it can be detected since
>> get_status() doesn't return the expected value for long time.
>>
>> And for the case that needs a timeout, we probably can use NEEDS_RESET.
> I think you're right. get_status is the synchronization point, not
> set_status.
>
> Currently there is no VDUSE GET_STATUS message. The
> VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so
> that the device emulation program can participate in emulating the
> Device Status field.


I'm not sure I get this, but it is what has been done?

+static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
+{
+    struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+    bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
+
+    dev->status = status;
+
+    if (dev->started == started)
+        return;
+
+    dev->started = started;
+    if (dev->started) {
+        vduse_dev_start_dataplane(dev);
+    } else {
+        vduse_dev_reset(dev);
+        vduse_dev_stop_dataplane(dev);
+    }
+}


But the looks not correct:

1) !DRIVER_OK doesn't means a reset?
2) Need to deal with FEATURES_OK

Thanks


>   This change could affect VDUSE's VIRTIO feature
> interface since the device emulation program can reject features by not
> setting FEATURES_OK.
>
> Stefan
Stefan Hajnoczi July 8, 2021, 9:06 a.m. UTC | #17
On Thu, Jul 08, 2021 at 12:17:56PM +0800, Jason Wang wrote:
> 
> 在 2021/7/7 下午11:54, Stefan Hajnoczi 写道:
> > On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> > > 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > > > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > > > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > > > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > > > > > > > > > Here the language (especially the word "generally") is weaker and means
> > > > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > > > > > > > > > containing an error code if the device encounters a problem unrelated to
> > > > > > > > > > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > > > > > > > > > code to 0, saving the driver an extra configuration space write access
> > > > > > > > > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > > > > > > > > > makes me think that the interface does not allow full VIRTIO semantics.
> > > > > > > > > > > Note that though you're correct, my understanding is that config space is
> > > > > > > > > > > not suitable for this kind of error propagating. And it would be very hard
> > > > > > > > > > > to implement such kind of semantic in some transports.  Virtqueue should be
> > > > > > > > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > > > > > > > > handle the message failure, I'm going to add a return value to
> > > > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > > > > > > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > > > > > > > > > driver can be modified to handle that.
> > > > > > > > > > > > 
> > > > > > > > > > > > Jason and Stefan, what do you think of this way?
> > > > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > > > > > > > > > 
> > > > > > > > > > The VIRTIO spec provides no way for the device to report errors from
> > > > > > > > > > config space accesses.
> > > > > > > > > > 
> > > > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > > > > > > > virtio_config_read*() and silently discards virtio_config_write*()
> > > > > > > > > > accesses.
> > > > > > > > > > 
> > > > > > > > > > VDUSE can take the same approach with
> > > > > > > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > > > > > > > > 
> > > > > > > > > > > I'd like to stick to the current assumption thich get_config won't fail.
> > > > > > > > > > > That is to say,
> > > > > > > > > > > 
> > > > > > > > > > > 1) maintain a config in the kernel, make sure the config space read can
> > > > > > > > > > > always succeed
> > > > > > > > > > > 2) introduce an ioctl for the vduse usersapce to update the config space.
> > > > > > > > > > > 3) we can synchronize with the vduse userspace during set_config
> > > > > > > > > > > 
> > > > > > > > > > > Does this work?
> > > > > > > > > > I noticed that caching is also allowed by the vhost-user protocol
> > > > > > > > > > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
> > > > > > > > > > know whether or not caching is in effect. The interface you outlined
> > > > > > > > > > above requires caching.
> > > > > > > > > > 
> > > > > > > > > > Is there a reason why the host kernel vDPA code needs to cache the
> > > > > > > > > > configuration space?
> > > > > > > > > Because:
> > > > > > > > > 
> > > > > > > > > 1) Kernel can not wait forever in get_config(), this is the major difference
> > > > > > > > > with vhost-user.
> > > > > > > > virtio_cread() can sleep:
> > > > > > > > 
> > > > > > > >      #define virtio_cread(vdev, structname, member, ptr)                     \
> > > > > > > >              do {                                                            \
> > > > > > > >                      typeof(((structname*)0)->member) virtio_cread_v;        \
> > > > > > > >                                                                              \
> > > > > > > >                      might_sleep();                                          \
> > > > > > > >                      ^^^^^^^^^^^^^^
> > > > > > > > 
> > > > > > > > Which code path cannot sleep?
> > > > > > > Well, it can sleep but it can't sleep forever. For VDUSE, a
> > > > > > > buggy/malicious userspace may refuse to respond to the get_config.
> > > > > > > 
> > > > > > > It looks to me the ideal case, with the current virtio spec, for VDUSE is to
> > > > > > > 
> > > > > > > 1) maintain the device and its state in the kernel, userspace may sync
> > > > > > > with the kernel device via ioctls
> > > > > > > 2) offload the datapath (virtqueue) to the userspace
> > > > > > > 
> > > > > > > This seems more robust and safe than simply relaying everything to
> > > > > > > userspace and waiting for its response.
> > > > > > > 
> > > > > > > And we know for sure this model can work, an example is TUN/TAP:
> > > > > > > netdevice is abstracted in the kernel and datapath is done via
> > > > > > > sendmsg()/recvmsg().
> > > > > > > 
> > > > > > > Maintaining the config in the kernel follows this model and it can
> > > > > > > simplify the device generation implementation.
> > > > > > > 
> > > > > > > For config space write, it requires more thought but fortunately it's
> > > > > > > not commonly used. So VDUSE can choose to filter out the
> > > > > > > device/features that depends on the config write.
> > > > > > This is the problem. There are other messages like SET_FEATURES where I
> > > > > > guess we'll face the same challenge.
> > > > > Probably not, userspace device can tell the kernel about the device_features
> > > > > and mandated_features during creation, and the feature negotiation could be
> > > > > done purely in the kernel without bothering the userspace.
> > > 
> > > (For some reason I drop the list accidentally, adding them back, sorry)
> > > 
> > > 
> > > > Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous
> > > > interface where the driver waits for the device.
> > > 
> > > It depends on how we define "synchronous" here. If I understand correctly,
> > > the spec doesn't expect there will be any kind of failure for the operation
> > > of set_status itself.
> > > 
> > > Instead, anytime it want any synchronization, it should be done via
> > > get_status():
> > > 
> > > 1) re-read device status to make sure FEATURES_OK is set during feature
> > > negotiation
> > > 2) re-read device status to be 0 to make sure the device has finish the
> > > reset
> > > 
> > > 
> > > > VDUSE currently doesn't wait for the device emulation process to handle
> > > > this message (no reply is needed) but I think this is a mistake because
> > > > VDUSE is not following the VIRTIO device model.
> > > 
> > > With the trick that is done for FEATURES_OK above, I think we don't need to
> > > wait for the reply.
> > > 
> > > If userspace takes too long to respond, it can be detected since
> > > get_status() doesn't return the expected value for long time.
> > > 
> > > And for the case that needs a timeout, we probably can use NEEDS_RESET.
> > I think you're right. get_status is the synchronization point, not
> > set_status.
> > 
> > Currently there is no VDUSE GET_STATUS message. The
> > VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so
> > that the device emulation program can participate in emulating the
> > Device Status field.
> 
> 
> I'm not sure I get this, but it is what has been done?
> 
> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> +{
> +    struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +    bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
> +
> +    dev->status = status;
> +
> +    if (dev->started == started)
> +        return;
> +
> +    dev->started = started;
> +    if (dev->started) {
> +        vduse_dev_start_dataplane(dev);
> +    } else {
> +        vduse_dev_reset(dev);
> +        vduse_dev_stop_dataplane(dev);
> +    }
> +}
> 
> 
> But the looks not correct:
> 
> 1) !DRIVER_OK doesn't means a reset?
> 2) Need to deal with FEATURES_OK

I'm not sure if this reply was to me or to Yongji Xie?

Currently vduse_vdpa_set_status() does not allow the device emulation
program to participate fully in Device Status field changes. It hides
the status bits and only sends VDUSE_START/STOP_DATAPLANE.

I suggest having GET_STATUS/SET_STATUS messages instead, allowing the
device emulation program to handle these parts of the VIRTIO device
model (e.g. rejecting combinations of features that are mutually
exclusive).

Stefan
Stefan Hajnoczi July 8, 2021, 9:07 a.m. UTC | #18
On Wed, Jul 07, 2021 at 05:09:13PM +0800, Yongji Xie wrote:
> On Tue, Jul 6, 2021 at 6:22 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote:
> > > On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > >
> > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > > > ioctl should not be called frequently.
> > > > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > > > Here the language (especially the word "generally") is weaker and means
> > > > > > > there may be exceptions.
> > > > > > >
> > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > > > containing an error code if the device encounters a problem unrelated to
> > > > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > > > code to 0, saving the driver an extra configuration space write access
> > > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > > > makes me think that the interface does not allow full VIRTIO semantics.
> > > > >
> > > > >
> > > > > Note that though you're correct, my understanding is that config space is
> > > > > not suitable for this kind of error propagating. And it would be very hard
> > > > > to implement such kind of semantic in some transports.  Virtqueue should be
> > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > "rarely-changing or intialization-time parameters".
> > > > >
> > > > >
> > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > > handle the message failure, I'm going to add a return value to
> > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > > > driver can be modified to handle that.
> > > > > >
> > > > > > Jason and Stefan, what do you think of this way?
> > > >
> > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > > >
> > >
> > > We add a timeout and return error in case userspace never replies to
> > > the message.
> > >
> > > > The VIRTIO spec provides no way for the device to report errors from
> > > > config space accesses.
> > > >
> > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > virtio_config_read*() and silently discards virtio_config_write*()
> > > > accesses.
> > > >
> > > > VDUSE can take the same approach with
> > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > >
> > >
> > > I noticed that virtio_config_read*() only returns -1 when we access a
> > > invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
> > > when we access a valid field. Not sure if it's ok to silently ignore
> > > this kind of error.
> >
> > That's a good point but it's a general VIRTIO issue. Any device
> > implementation (QEMU userspace, hardware vDPA, etc) can fail, so the
> > VIRTIO specification needs to provide a way for the driver to detect
> > this.
> >
> > If userspace violates the contract then VDUSE needs to mark the device
> > broken. QEMU's device emulation does something similar with the
> > vdev->broken flag.
> >
> > The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by
> > vDPA/VDUSE to indicate that the device is not operational and must be
> > reset.
> >
> 
> It might be a solution. But DEVICE_NEEDS_RESET  is not implemented
> currently. So I'm thinking whether it's ok to add a check of
> DEVICE_NEEDS_RESET status bit in probe function of virtio device
> driver (e.g. virtio-blk driver). Then VDUSE can make use of it to fail
> device initailization when configuration space access failed.

Okay.

Stefan
Yongji Xie July 8, 2021, 12:35 p.m. UTC | #19
On Thu, Jul 8, 2021 at 5:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Jul 08, 2021 at 12:17:56PM +0800, Jason Wang wrote:
> >
> > 在 2021/7/7 下午11:54, Stefan Hajnoczi 写道:
> > > On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> > > > 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > > > > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > > > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > > > > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > > > > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > > > > > > > > > > Here the language (especially the word "generally") is weaker and means
> > > > > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > > > > > > > > > > containing an error code if the device encounters a problem unrelated to
> > > > > > > > > > > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > > > > > > > > > > code to 0, saving the driver an extra configuration space write access
> > > > > > > > > > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > > > > > > > > > > makes me think that the interface does not allow full VIRTIO semantics.
> > > > > > > > > > > > Note that though you're correct, my understanding is that config space is
> > > > > > > > > > > > not suitable for this kind of error propagating. And it would be very hard
> > > > > > > > > > > > to implement such kind of semantic in some transports.  Virtqueue should be
> > > > > > > > > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > > > > > > > > > handle the message failure, I'm going to add a return value to
> > > > > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > > > > > > > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > > > > > > > > > > driver can be modified to handle that.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Jason and Stefan, what do you think of this way?
> > > > > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > > > > > > > > > >
> > > > > > > > > > > The VIRTIO spec provides no way for the device to report errors from
> > > > > > > > > > > config space accesses.
> > > > > > > > > > >
> > > > > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > > > > > > > > virtio_config_read*() and silently discards virtio_config_write*()
> > > > > > > > > > > accesses.
> > > > > > > > > > >
> > > > > > > > > > > VDUSE can take the same approach with
> > > > > > > > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > > > > > > > > >
> > > > > > > > > > > > I'd like to stick to the current assumption thich get_config won't fail.
> > > > > > > > > > > > That is to say,
> > > > > > > > > > > >
> > > > > > > > > > > > 1) maintain a config in the kernel, make sure the config space read can
> > > > > > > > > > > > always succeed
> > > > > > > > > > > > 2) introduce an ioctl for the vduse usersapce to update the config space.
> > > > > > > > > > > > 3) we can synchronize with the vduse userspace during set_config
> > > > > > > > > > > >
> > > > > > > > > > > > Does this work?
> > > > > > > > > > > I noticed that caching is also allowed by the vhost-user protocol
> > > > > > > > > > > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
> > > > > > > > > > > know whether or not caching is in effect. The interface you outlined
> > > > > > > > > > > above requires caching.
> > > > > > > > > > >
> > > > > > > > > > > Is there a reason why the host kernel vDPA code needs to cache the
> > > > > > > > > > > configuration space?
> > > > > > > > > > Because:
> > > > > > > > > >
> > > > > > > > > > 1) Kernel can not wait forever in get_config(), this is the major difference
> > > > > > > > > > with vhost-user.
> > > > > > > > > virtio_cread() can sleep:
> > > > > > > > >
> > > > > > > > >      #define virtio_cread(vdev, structname, member, ptr)                     \
> > > > > > > > >              do {                                                            \
> > > > > > > > >                      typeof(((structname*)0)->member) virtio_cread_v;        \
> > > > > > > > >                                                                              \
> > > > > > > > >                      might_sleep();                                          \
> > > > > > > > >                      ^^^^^^^^^^^^^^
> > > > > > > > >
> > > > > > > > > Which code path cannot sleep?
> > > > > > > > Well, it can sleep but it can't sleep forever. For VDUSE, a
> > > > > > > > buggy/malicious userspace may refuse to respond to the get_config.
> > > > > > > >
> > > > > > > > It looks to me the ideal case, with the current virtio spec, for VDUSE is to
> > > > > > > >
> > > > > > > > 1) maintain the device and its state in the kernel, userspace may sync
> > > > > > > > with the kernel device via ioctls
> > > > > > > > 2) offload the datapath (virtqueue) to the userspace
> > > > > > > >
> > > > > > > > This seems more robust and safe than simply relaying everything to
> > > > > > > > userspace and waiting for its response.
> > > > > > > >
> > > > > > > > And we know for sure this model can work, an example is TUN/TAP:
> > > > > > > > netdevice is abstracted in the kernel and datapath is done via
> > > > > > > > sendmsg()/recvmsg().
> > > > > > > >
> > > > > > > > Maintaining the config in the kernel follows this model and it can
> > > > > > > > simplify the device generation implementation.
> > > > > > > >
> > > > > > > > For config space write, it requires more thought but fortunately it's
> > > > > > > > not commonly used. So VDUSE can choose to filter out the
> > > > > > > > device/features that depends on the config write.
> > > > > > > This is the problem. There are other messages like SET_FEATURES where I
> > > > > > > guess we'll face the same challenge.
> > > > > > Probably not, userspace device can tell the kernel about the device_features
> > > > > > and mandated_features during creation, and the feature negotiation could be
> > > > > > done purely in the kernel without bothering the userspace.
> > > >
> > > > (For some reason I drop the list accidentally, adding them back, sorry)
> > > >
> > > >
> > > > > Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous
> > > > > interface where the driver waits for the device.
> > > >
> > > > It depends on how we define "synchronous" here. If I understand correctly,
> > > > the spec doesn't expect there will be any kind of failure for the operation
> > > > of set_status itself.
> > > >
> > > > Instead, anytime it want any synchronization, it should be done via
> > > > get_status():
> > > >
> > > > 1) re-read device status to make sure FEATURES_OK is set during feature
> > > > negotiation
> > > > 2) re-read device status to be 0 to make sure the device has finish the
> > > > reset
> > > >
> > > >
> > > > > VDUSE currently doesn't wait for the device emulation process to handle
> > > > > this message (no reply is needed) but I think this is a mistake because
> > > > > VDUSE is not following the VIRTIO device model.
> > > >
> > > > With the trick that is done for FEATURES_OK above, I think we don't need to
> > > > wait for the reply.
> > > >
> > > > If userspace takes too long to respond, it can be detected since
> > > > get_status() doesn't return the expected value for long time.
> > > >
> > > > And for the case that needs a timeout, we probably can use NEEDS_RESET.
> > > I think you're right. get_status is the synchronization point, not
> > > set_status.
> > >
> > > Currently there is no VDUSE GET_STATUS message. The
> > > VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so
> > > that the device emulation program can participate in emulating the
> > > Device Status field.
> >
> >
> > I'm not sure I get this, but it is what has been done?
> >
> > +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> > +{
> > +    struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > +    bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
> > +
> > +    dev->status = status;
> > +
> > +    if (dev->started == started)
> > +        return;
> > +
> > +    dev->started = started;
> > +    if (dev->started) {
> > +        vduse_dev_start_dataplane(dev);
> > +    } else {
> > +        vduse_dev_reset(dev);
> > +        vduse_dev_stop_dataplane(dev);
> > +    }
> > +}
> >
> >
> > But the looks not correct:
> >
> > 1) !DRIVER_OK doesn't means a reset?
> > 2) Need to deal with FEATURES_OK
>
> I'm not sure if this reply was to me or to Yongji Xie?
>
> Currently vduse_vdpa_set_status() does not allow the device emulation
> program to participate fully in Device Status field changes. It hides
> the status bits and only sends VDUSE_START/STOP_DATAPLANE.
>
> I suggest having GET_STATUS/SET_STATUS messages instead, allowing the
> device emulation program to handle these parts of the VIRTIO device
> model (e.g. rejecting combinations of features that are mutually
> exclusive).
>

Yes, I will do this in the next version.

Thanks,
Yongi
diff mbox series

Patch

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 0b5eefed027e..c432be070f67 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -27,6 +27,7 @@  place where this information is gathered.
    iommu
    media/index
    sysfs-platform_profile
+   vduse
 
 .. only::  subproject and html
 
diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
new file mode 100644
index 000000000000..2f9cd1a4e530
--- /dev/null
+++ b/Documentation/userspace-api/vduse.rst
@@ -0,0 +1,222 @@ 
+==================================
+VDUSE - "vDPA Device in Userspace"
+==================================
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace. And
+to make it simple, the emulated vDPA device's control path is handled in
+the kernel and only the data path is implemented in the userspace.
+
+Note that only virtio block device is supported by VDUSE framework now,
+which can reduce security risks when the userspace process that implements
+the data path is run by an unprivileged user. The Support for other device
+types can be added after the security issue is clarified or fixed in the future.
+
+Start/Stop VDUSE devices
+------------------------
+
+VDUSE devices are started as follows:
+
+1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
+   /dev/vduse/control.
+
+2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
+   messages will arrive while attaching the VDUSE instance to vDPA bus.
+
+3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
+   instance to vDPA bus.
+
+VDUSE devices are stopped as follows:
+
+1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
+   instance from vDPA bus.
+
+2. Close the file descriptor referring to /dev/vduse/$NAME
+
+3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
+   /dev/vduse/control
+
+The netlink messages metioned above can be sent via vdpa tool in iproute2
+or use the below sample codes:
+
+.. code-block:: c
+
+	static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
+	{
+		struct nl_sock *nlsock;
+		struct nl_msg *msg;
+		int famid;
+
+		nlsock = nl_socket_alloc();
+		if (!nlsock)
+			return -ENOMEM;
+
+		if (genl_connect(nlsock))
+			goto free_sock;
+
+		famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
+		if (famid < 0)
+			goto close_sock;
+
+		msg = nlmsg_alloc();
+		if (!msg)
+			goto close_sock;
+
+		if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, cmd, 0))
+			goto nla_put_failure;
+
+		NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
+		if (cmd == VDPA_CMD_DEV_NEW)
+			NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
+
+		if (nl_send_sync(nlsock, msg))
+			goto close_sock;
+
+		nl_close(nlsock);
+		nl_socket_free(nlsock);
+
+		return 0;
+	nla_put_failure:
+		nlmsg_free(msg);
+	close_sock:
+		nl_close(nlsock);
+	free_sock:
+		nl_socket_free(nlsock);
+		return -1;
+	}
+
+How VDUSE works
+---------------
+
+Since the emuldated vDPA device's control path is handled in the kernel,
+a message-based communication protocol and few types of control messages
+are introduced by VDUSE framework to make userspace be aware of the data
+path related changes:
+
+- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace
+
+- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane
+
+- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane
+
+- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
+
+Userspace needs to read()/write() on /dev/vduse/$NAME to receive/reply
+those control messages from/to VDUSE kernel module as follows:
+
+.. code-block:: c
+
+	static int vduse_message_handler(int dev_fd)
+	{
+		int len;
+		struct vduse_dev_request req;
+		struct vduse_dev_response resp;
+
+		len = read(dev_fd, &req, sizeof(req));
+		if (len != sizeof(req))
+			return -1;
+
+		resp.request_id = req.request_id;
+
+		switch (req.type) {
+
+		/* handle different types of message */
+
+		}
+
+		if (req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
+			return 0;
+
+		len = write(dev_fd, &resp, sizeof(resp));
+		if (len != sizeof(resp))
+			return -1;
+
+		return 0;
+	}
+
+After VDUSE_START_DATAPLANE messages is received, userspace should start the
+dataplane processing with the help of some ioctls on /dev/vduse/$NAME:
+
+- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region.
+  Userspace can access this iova region by passing fd and corresponding size, offset,
+  perm to mmap(). For example:
+
+.. code-block:: c
+
+	static int perm_to_prot(uint8_t perm)
+	{
+		int prot = 0;
+
+		switch (perm) {
+		case VDUSE_ACCESS_WO:
+			prot |= PROT_WRITE;
+			break;
+		case VDUSE_ACCESS_RO:
+			prot |= PROT_READ;
+			break;
+		case VDUSE_ACCESS_RW:
+			prot |= PROT_READ | PROT_WRITE;
+			break;
+		}
+
+		return prot;
+	}
+
+	static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
+	{
+		int fd;
+		void *addr;
+		size_t size;
+		struct vduse_iotlb_entry entry;
+
+		entry.start = iova;
+		entry.last = iova + 1;
+		fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
+		if (fd < 0)
+			return NULL;
+
+		size = entry.last - entry.start + 1;
+		*len = entry.last - iova + 1;
+		addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
+			    fd, entry.offset);
+		close(fd);
+		if (addr == MAP_FAILED)
+			return NULL;
+
+		/* do something to cache this iova region */
+
+		return addr + iova - entry.start;
+	}
+
+- VDUSE_DEV_GET_FEATURES: Get the negotiated features
+
+- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt
+
+- VDUSE_VQ_GET_INFO: Get the specified virtqueue's metadata
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE kernel module to notify userspace to consume the vring.
+
+- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
+
+MMU-based IOMMU Driver
+----------------------
+
+VDUSE framework implements an MMU-based on-chip IOMMU driver to support
+mapping the kernel DMA buffer into the userspace iova region dynamically.
+This is mainly designed for virtio-vdpa case (kernel virtio drivers).
+
+The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
+The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
+so that the userspace process is able to use its virtual address to access
+the DMA buffer in kernel.
+
+And to avoid security issue, a bounce-buffering mechanism is introduced to
+prevent userspace accessing the original buffer directly which may contain other
+kernel data. During the mapping, unmapping, the driver will copy the data from
+the original buffer to the bounce buffer and back, depending on the direction of
+the transfer. And the bounce-buffer addresses will be mapped into the user address
+space instead of the original one.