Message ID | 20210615141331.407-11-xieyongji@bytedance.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Introduce VDUSE - vDPA Device in Userspace | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote: > VDUSE (vDPA Device in Userspace) is a framework to support > implementing software-emulated vDPA devices in userspace. This > document is intended to clarify the VDUSE design and usage. > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com> > --- > Documentation/userspace-api/index.rst | 1 + > Documentation/userspace-api/vduse.rst | 222 ++++++++++++++++++++++++++++++++++ > 2 files changed, 223 insertions(+) > create mode 100644 Documentation/userspace-api/vduse.rst > > diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst > index 0b5eefed027e..c432be070f67 100644 > --- a/Documentation/userspace-api/index.rst > +++ b/Documentation/userspace-api/index.rst > @@ -27,6 +27,7 @@ place where this information is gathered. > iommu > media/index > sysfs-platform_profile > + vduse > > .. only:: subproject and html > > diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst > new file mode 100644 > index 000000000000..2f9cd1a4e530 > --- /dev/null > +++ b/Documentation/userspace-api/vduse.rst > @@ -0,0 +1,222 @@ > +================================== > +VDUSE - "vDPA Device in Userspace" > +================================== > + > +vDPA (virtio data path acceleration) device is a device that uses a > +datapath which complies with the virtio specifications with vendor > +specific control path. vDPA devices can be both physically located on > +the hardware or emulated by software. VDUSE is a framework that makes it > +possible to implement software-emulated vDPA devices in userspace. And > +to make it simple, the emulated vDPA device's control path is handled in > +the kernel and only the data path is implemented in the userspace. > + > +Note that only virtio block device is supported by VDUSE framework now, > +which can reduce security risks when the userspace process that implements > +the data path is run by an unprivileged user. The Support for other device > +types can be added after the security issue is clarified or fixed in the future. > + > +Start/Stop VDUSE devices > +------------------------ > + > +VDUSE devices are started as follows: > + > +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on > + /dev/vduse/control. > + > +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first > + messages will arrive while attaching the VDUSE instance to vDPA bus. > + > +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE > + instance to vDPA bus. > + > +VDUSE devices are stopped as follows: > + > +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE > + instance from vDPA bus. > + > +2. Close the file descriptor referring to /dev/vduse/$NAME > + > +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on > + /dev/vduse/control > + > +The netlink messages metioned above can be sent via vdpa tool in iproute2 > +or use the below sample codes: > + > +.. code-block:: c > + > + static int netlink_add_vduse(const char *name, enum vdpa_command cmd) > + { > + struct nl_sock *nlsock; > + struct nl_msg *msg; > + int famid; > + > + nlsock = nl_socket_alloc(); > + if (!nlsock) > + return -ENOMEM; > + > + if (genl_connect(nlsock)) > + goto free_sock; > + > + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME); > + if (famid < 0) > + goto close_sock; > + > + msg = nlmsg_alloc(); > + if (!msg) > + goto close_sock; > + > + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, cmd, 0)) > + goto nla_put_failure; > + > + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name); > + if (cmd == VDPA_CMD_DEV_NEW) > + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse"); > + > + if (nl_send_sync(nlsock, msg)) > + goto close_sock; > + > + nl_close(nlsock); > + nl_socket_free(nlsock); > + > + return 0; > + nla_put_failure: > + nlmsg_free(msg); > + close_sock: > + nl_close(nlsock); > + free_sock: > + nl_socket_free(nlsock); > + return -1; > + } > + > +How VDUSE works > +--------------- > + > +Since the emuldated vDPA device's control path is handled in the kernel, s/emuldated/emulated/ > +a message-based communication protocol and few types of control messages > +are introduced by VDUSE framework to make userspace be aware of the data > +path related changes: > + > +- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace > + > +- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane > + > +- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane > + > +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB > + > +Userspace needs to read()/write() on /dev/vduse/$NAME to receive/reply > +those control messages from/to VDUSE kernel module as follows: > + > +.. code-block:: c > + > + static int vduse_message_handler(int dev_fd) > + { > + int len; > + struct vduse_dev_request req; > + struct vduse_dev_response resp; > + > + len = read(dev_fd, &req, sizeof(req)); > + if (len != sizeof(req)) > + return -1; > + > + resp.request_id = req.request_id; > + > + switch (req.type) { > + > + /* handle different types of message */ > + > + } > + > + if (req.flags & VDUSE_REQ_FLAGS_NO_REPLY) > + return 0; > + > + len = write(dev_fd, &resp, sizeof(resp)); > + if (len != sizeof(resp)) > + return -1; > + > + return 0; > + } > + > +After VDUSE_START_DATAPLANE messages is received, userspace should start the > +dataplane processing with the help of some ioctls on /dev/vduse/$NAME: > + > +- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region. > + Userspace can access this iova region by passing fd and corresponding size, offset, > + perm to mmap(). For example: > + > +.. code-block:: c > + > + static int perm_to_prot(uint8_t perm) > + { > + int prot = 0; > + > + switch (perm) { > + case VDUSE_ACCESS_WO: > + prot |= PROT_WRITE; > + break; > + case VDUSE_ACCESS_RO: > + prot |= PROT_READ; > + break; > + case VDUSE_ACCESS_RW: > + prot |= PROT_READ | PROT_WRITE; > + break; > + } > + > + return prot; > + } > + > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len) > + { > + int fd; > + void *addr; > + size_t size; > + struct vduse_iotlb_entry entry; > + > + entry.start = iova; > + entry.last = iova + 1; Why +1? I expected the request to include *len so that VDUSE can create a bounce buffer for the full iova range, if necessary. > + fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry); > + if (fd < 0) > + return NULL; > + > + size = entry.last - entry.start + 1; > + *len = entry.last - iova + 1; > + addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED, > + fd, entry.offset); > + close(fd); > + if (addr == MAP_FAILED) > + return NULL; > + > + /* do something to cache this iova region */ How is userspace expected to manage iotlb mmaps? When should munmap(2) be called? Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of guest RAM (e.g. multiple gigabytes) that can be cached permanently or will it return just enough pages to cover [start, last)? > + > + return addr + iova - entry.start; > + } > + > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features Are these VIRTIO feature bits? Please explain how feature negotiation works. There must be a way for userspace to report the device's supported feature bits to the kernel. > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt Does this mean the contents of the configuration space are cached by VDUSE? The downside is that the userspace code cannot generate the contents on demand. Most devices doin't need to generate the contents on demand, so I think this is okay but I had expected a different interface: kernel->userspace VDUSE_DEV_GET_CONFIG userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ I think you can leave it the way it is, but I wanted to mention this in case someone thinks it's important to support generating the contents of the configuration space on demand. > +- VDUSE_VQ_GET_INFO: Get the specified virtqueue's metadata > + > +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used > + by VDUSE kernel module to notify userspace to consume the vring. > + > +- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue This information is useful but it's not enough to be able to implement a userspace device. Please provide more developer documentation or at least refer to uapi header files, published documents, etc that contain the details.
On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote: > > VDUSE (vDPA Device in Userspace) is a framework to support > > implementing software-emulated vDPA devices in userspace. This > > document is intended to clarify the VDUSE design and usage. > > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com> > > --- > > Documentation/userspace-api/index.rst | 1 + > > Documentation/userspace-api/vduse.rst | 222 ++++++++++++++++++++++++++++++++++ > > 2 files changed, 223 insertions(+) > > create mode 100644 Documentation/userspace-api/vduse.rst > > > > diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst > > index 0b5eefed027e..c432be070f67 100644 > > --- a/Documentation/userspace-api/index.rst > > +++ b/Documentation/userspace-api/index.rst > > @@ -27,6 +27,7 @@ place where this information is gathered. > > iommu > > media/index > > sysfs-platform_profile > > + vduse > > > > .. only:: subproject and html > > > > diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst > > new file mode 100644 > > index 000000000000..2f9cd1a4e530 > > --- /dev/null > > +++ b/Documentation/userspace-api/vduse.rst > > @@ -0,0 +1,222 @@ > > +================================== > > +VDUSE - "vDPA Device in Userspace" > > +================================== > > + > > +vDPA (virtio data path acceleration) device is a device that uses a > > +datapath which complies with the virtio specifications with vendor > > +specific control path. vDPA devices can be both physically located on > > +the hardware or emulated by software. VDUSE is a framework that makes it > > +possible to implement software-emulated vDPA devices in userspace. And > > +to make it simple, the emulated vDPA device's control path is handled in > > +the kernel and only the data path is implemented in the userspace. > > + > > +Note that only virtio block device is supported by VDUSE framework now, > > +which can reduce security risks when the userspace process that implements > > +the data path is run by an unprivileged user. The Support for other device > > +types can be added after the security issue is clarified or fixed in the future. > > + > > +Start/Stop VDUSE devices > > +------------------------ > > + > > +VDUSE devices are started as follows: > > + > > +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on > > + /dev/vduse/control. > > + > > +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first > > + messages will arrive while attaching the VDUSE instance to vDPA bus. > > + > > +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE > > + instance to vDPA bus. > > + > > +VDUSE devices are stopped as follows: > > + > > +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE > > + instance from vDPA bus. > > + > > +2. Close the file descriptor referring to /dev/vduse/$NAME > > + > > +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on > > + /dev/vduse/control > > + > > +The netlink messages metioned above can be sent via vdpa tool in iproute2 > > +or use the below sample codes: > > + > > +.. code-block:: c > > + > > + static int netlink_add_vduse(const char *name, enum vdpa_command cmd) > > + { > > + struct nl_sock *nlsock; > > + struct nl_msg *msg; > > + int famid; > > + > > + nlsock = nl_socket_alloc(); > > + if (!nlsock) > > + return -ENOMEM; > > + > > + if (genl_connect(nlsock)) > > + goto free_sock; > > + > > + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME); > > + if (famid < 0) > > + goto close_sock; > > + > > + msg = nlmsg_alloc(); > > + if (!msg) > > + goto close_sock; > > + > > + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, cmd, 0)) > > + goto nla_put_failure; > > + > > + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name); > > + if (cmd == VDPA_CMD_DEV_NEW) > > + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse"); > > + > > + if (nl_send_sync(nlsock, msg)) > > + goto close_sock; > > + > > + nl_close(nlsock); > > + nl_socket_free(nlsock); > > + > > + return 0; > > + nla_put_failure: > > + nlmsg_free(msg); > > + close_sock: > > + nl_close(nlsock); > > + free_sock: > > + nl_socket_free(nlsock); > > + return -1; > > + } > > + > > +How VDUSE works > > +--------------- > > + > > +Since the emuldated vDPA device's control path is handled in the kernel, > > s/emuldated/emulated/ > Will fix it. > > +a message-based communication protocol and few types of control messages > > +are introduced by VDUSE framework to make userspace be aware of the data > > +path related changes: > > + > > +- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace > > + > > +- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane > > + > > +- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane > > + > > +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB > > + > > +Userspace needs to read()/write() on /dev/vduse/$NAME to receive/reply > > +those control messages from/to VDUSE kernel module as follows: > > + > > +.. code-block:: c > > + > > + static int vduse_message_handler(int dev_fd) > > + { > > + int len; > > + struct vduse_dev_request req; > > + struct vduse_dev_response resp; > > + > > + len = read(dev_fd, &req, sizeof(req)); > > + if (len != sizeof(req)) > > + return -1; > > + > > + resp.request_id = req.request_id; > > + > > + switch (req.type) { > > + > > + /* handle different types of message */ > > + > > + } > > + > > + if (req.flags & VDUSE_REQ_FLAGS_NO_REPLY) > > + return 0; > > + > > + len = write(dev_fd, &resp, sizeof(resp)); > > + if (len != sizeof(resp)) > > + return -1; > > + > > + return 0; > > + } > > + > > +After VDUSE_START_DATAPLANE messages is received, userspace should start the > > +dataplane processing with the help of some ioctls on /dev/vduse/$NAME: > > + > > +- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region. > > + Userspace can access this iova region by passing fd and corresponding size, offset, > > + perm to mmap(). For example: > > + > > +.. code-block:: c > > + > > + static int perm_to_prot(uint8_t perm) > > + { > > + int prot = 0; > > + > > + switch (perm) { > > + case VDUSE_ACCESS_WO: > > + prot |= PROT_WRITE; > > + break; > > + case VDUSE_ACCESS_RO: > > + prot |= PROT_READ; > > + break; > > + case VDUSE_ACCESS_RW: > > + prot |= PROT_READ | PROT_WRITE; > > + break; > > + } > > + > > + return prot; > > + } > > + > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len) > > + { > > + int fd; > > + void *addr; > > + size_t size; > > + struct vduse_iotlb_entry entry; > > + > > + entry.start = iova; > > + entry.last = iova + 1; > > Why +1? > > I expected the request to include *len so that VDUSE can create a bounce > buffer for the full iova range, if necessary. > The function is used to translate iova to va. And the *len is not specified by the caller. Instead, it's used to tell the caller the length of the contiguous iova region from the specified iova. And the ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first overlapped iova region. So using iova + 1 should be enough here. > > + fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry); > > + if (fd < 0) > > + return NULL; > > + > > + size = entry.last - entry.start + 1; > > + *len = entry.last - iova + 1; > > + addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED, > > + fd, entry.offset); > > + close(fd); > > + if (addr == MAP_FAILED) > > + return NULL; > > + > > + /* do something to cache this iova region */ > > How is userspace expected to manage iotlb mmaps? When should munmap(2) > be called? > The simple way is using a list to store the iotlb mappings. And we should call the munmap(2) for the old mappings when VDUSE_UPDATE_IOTLB or VDUSE_STOP_DATAPLANE message is received. > Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of > guest RAM (e.g. multiple gigabytes) that can be cached permanently or > will it return just enough pages to cover [start, last)? > It should return one iotlb mapping that covers [start, last). In vhost-vdpa cases, it might be a full chunk of guest RAM. In virtio-vdpa cases, it might be the whole bounce buffer or one coherent mapping (produced by dma_alloc_coherent()). > > + > > + return addr + iova - entry.start; > > + } > > + > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features > > Are these VIRTIO feature bits? Please explain how feature negotiation > works. There must be a way for userspace to report the device's > supported feature bits to the kernel. > Yes, these are VIRTIO feature bits. Userspace will specify the device's supported feature bits when creating a new VDUSE device with ioctl(VDUSE_CREATE_DEV). > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt > > Does this mean the contents of the configuration space are cached by > VDUSE? Yes, but the kernel will also store the same contents. > The downside is that the userspace code cannot generate the > contents on demand. Most devices doin't need to generate the contents > on demand, so I think this is okay but I had expected a different > interface: > > kernel->userspace VDUSE_DEV_GET_CONFIG > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We will need lots of modification of virtio codes to support that. So to make it simple, we choose this way: userspace -> kernel VDUSE_DEV_SET_CONFIG userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ > I think you can leave it the way it is, but I wanted to mention this in > case someone thinks it's important to support generating the contents of > the configuration space on demand. > Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and VDUSE_DEV_INJECT_CONFIG_IRQ achieve that? > > +- VDUSE_VQ_GET_INFO: Get the specified virtqueue's metadata > > + > > +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used > > + by VDUSE kernel module to notify userspace to consume the vring. > > + > > +- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue > > This information is useful but it's not enough to be able to implement a > userspace device. Please provide more developer documentation or at > least refer to uapi header files, published documents, etc that contain > the details. OK, I will try to add more details. Thanks, Yongji
On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote: > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote: > > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len) > > > + { > > > + int fd; > > > + void *addr; > > > + size_t size; > > > + struct vduse_iotlb_entry entry; > > > + > > > + entry.start = iova; > > > + entry.last = iova + 1; > > > > Why +1? > > > > I expected the request to include *len so that VDUSE can create a bounce > > buffer for the full iova range, if necessary. > > > > The function is used to translate iova to va. And the *len is not > specified by the caller. Instead, it's used to tell the caller the > length of the contiguous iova region from the specified iova. And the > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first > overlapped iova region. So using iova + 1 should be enough here. Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I wonder why userspace needs to assign a value at all if it's always +1. > > > > + fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry); > > > + if (fd < 0) > > > + return NULL; > > > + > > > + size = entry.last - entry.start + 1; > > > + *len = entry.last - iova + 1; > > > + addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED, > > > + fd, entry.offset); > > > + close(fd); > > > + if (addr == MAP_FAILED) > > > + return NULL; > > > + > > > + /* do something to cache this iova region */ > > > > How is userspace expected to manage iotlb mmaps? When should munmap(2) > > be called? > > > > The simple way is using a list to store the iotlb mappings. And we > should call the munmap(2) for the old mappings when VDUSE_UPDATE_IOTLB > or VDUSE_STOP_DATAPLANE message is received. Thanks for explaining. It would be helpful to have a description of IOTLB operation in this document. > > Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of > > guest RAM (e.g. multiple gigabytes) that can be cached permanently or > > will it return just enough pages to cover [start, last)? > > > > It should return one iotlb mapping that covers [start, last). In > vhost-vdpa cases, it might be a full chunk of guest RAM. In > virtio-vdpa cases, it might be the whole bounce buffer or one coherent > mapping (produced by dma_alloc_coherent()). Great, thanks. Adding something about this to the documentation would help others implementing VDUSE devices or libraries. > > > + > > > + return addr + iova - entry.start; > > > + } > > > + > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features > > > > Are these VIRTIO feature bits? Please explain how feature negotiation > > works. There must be a way for userspace to report the device's > > supported feature bits to the kernel. > > > > Yes, these are VIRTIO feature bits. Userspace will specify the > device's supported feature bits when creating a new VDUSE device with > ioctl(VDUSE_CREATE_DEV). Can the VDUSE device influence feature bit negotiation? For example, if the VDUSE virtio-blk device does not implement discard/write-zeroes, how does QEMU or the guest find out about this? > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt > > > > Does this mean the contents of the configuration space are cached by > > VDUSE? > > Yes, but the kernel will also store the same contents. > > > The downside is that the userspace code cannot generate the > > contents on demand. Most devices doin't need to generate the contents > > on demand, so I think this is okay but I had expected a different > > interface: > > > > kernel->userspace VDUSE_DEV_GET_CONFIG > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ > > > > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We > will need lots of modification of virtio codes to support that. So to > make it simple, we choose this way: > > userspace -> kernel VDUSE_DEV_SET_CONFIG > userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ > > > I think you can leave it the way it is, but I wanted to mention this in > > case someone thinks it's important to support generating the contents of > > the configuration space on demand. > > > > Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and > VDUSE_DEV_INJECT_CONFIG_IRQ achieve that? If the contents of the configuration space change continuously, then the VDUSE_DEV_SET_CONFIG approach is inefficient and might have race conditions. For example, imagine a device where the driver can read a timer from the configuration space. I think the VIRTIO device model allows that although I'm not aware of any devices that do something like it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be called frequently to keep the timer value updated even though the guest driver probably isn't accessing it. What's worse is that there might be race conditions where other driver->device operations are supposed to update the configuration space but VDUSE_DEV_SET_CONFIG means that the VDUSE kernel code is caching an outdated copy. Again, I don't think it's a problem for existing devices in the VIRTIO specification. But I'm not 100% sure and future devices might require what I've described, so the VDUSE_DEV_SET_CONFIG interface could become a problem. Stefan
On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote: > > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote: > > > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len) > > > > + { > > > > + int fd; > > > > + void *addr; > > > > + size_t size; > > > > + struct vduse_iotlb_entry entry; > > > > + > > > > + entry.start = iova; > > > > + entry.last = iova + 1; > > > > > > Why +1? > > > > > > I expected the request to include *len so that VDUSE can create a bounce > > > buffer for the full iova range, if necessary. > > > > > > > The function is used to translate iova to va. And the *len is not > > specified by the caller. Instead, it's used to tell the caller the > > length of the contiguous iova region from the specified iova. And the > > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first > > overlapped iova region. So using iova + 1 should be enough here. > > Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I > wonder why userspace needs to assign a value at all if it's always +1. > If we need to get some iova regions in the specified range, we need the entry.last field. For example, we can use [0, ULONG_MAX] to get the first overlapped iova region which might be [4096, 8192]. But in this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to get the iova region including the specified iova. > > > > > > + fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry); > > > > + if (fd < 0) > > > > + return NULL; > > > > + > > > > + size = entry.last - entry.start + 1; > > > > + *len = entry.last - iova + 1; > > > > + addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED, > > > > + fd, entry.offset); > > > > + close(fd); > > > > + if (addr == MAP_FAILED) > > > > + return NULL; > > > > + > > > > + /* do something to cache this iova region */ > > > > > > How is userspace expected to manage iotlb mmaps? When should munmap(2) > > > be called? > > > > > > > The simple way is using a list to store the iotlb mappings. And we > > should call the munmap(2) for the old mappings when VDUSE_UPDATE_IOTLB > > or VDUSE_STOP_DATAPLANE message is received. > > Thanks for explaining. It would be helpful to have a description of > IOTLB operation in this document. > Sure. > > > Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of > > > guest RAM (e.g. multiple gigabytes) that can be cached permanently or > > > will it return just enough pages to cover [start, last)? > > > > > > > It should return one iotlb mapping that covers [start, last). In > > vhost-vdpa cases, it might be a full chunk of guest RAM. In > > virtio-vdpa cases, it might be the whole bounce buffer or one coherent > > mapping (produced by dma_alloc_coherent()). > > Great, thanks. Adding something about this to the documentation would > help others implementing VDUSE devices or libraries. > OK. > > > > + > > > > + return addr + iova - entry.start; > > > > + } > > > > + > > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features > > > > > > Are these VIRTIO feature bits? Please explain how feature negotiation > > > works. There must be a way for userspace to report the device's > > > supported feature bits to the kernel. > > > > > > > Yes, these are VIRTIO feature bits. Userspace will specify the > > device's supported feature bits when creating a new VDUSE device with > > ioctl(VDUSE_CREATE_DEV). > > Can the VDUSE device influence feature bit negotiation? For example, if > the VDUSE virtio-blk device does not implement discard/write-zeroes, how > does QEMU or the guest find out about this? > There is a "features" field in struct vduse_dev_config which is used to do feature negotiation. > > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt > > > > > > Does this mean the contents of the configuration space are cached by > > > VDUSE? > > > > Yes, but the kernel will also store the same contents. > > > > > The downside is that the userspace code cannot generate the > > > contents on demand. Most devices doin't need to generate the contents > > > on demand, so I think this is okay but I had expected a different > > > interface: > > > > > > kernel->userspace VDUSE_DEV_GET_CONFIG > > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ > > > > > > > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We > > will need lots of modification of virtio codes to support that. So to > > make it simple, we choose this way: > > > > userspace -> kernel VDUSE_DEV_SET_CONFIG > > userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ > > > > > I think you can leave it the way it is, but I wanted to mention this in > > > case someone thinks it's important to support generating the contents of > > > the configuration space on demand. > > > > > > > Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and > > VDUSE_DEV_INJECT_CONFIG_IRQ achieve that? > > If the contents of the configuration space change continuously, then the > VDUSE_DEV_SET_CONFIG approach is inefficient and might have race > conditions. For example, imagine a device where the driver can read a > timer from the configuration space. I think the VIRTIO device model > allows that although I'm not aware of any devices that do something like > it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be > called frequently to keep the timer value updated even though the guest > driver probably isn't accessing it. > OK, I get you now. Since the VIRTIO specification says "Device configuration space is generally used for rarely-changing or initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG ioctl should not be called frequently. > What's worse is that there might be race conditions where other > driver->device operations are supposed to update the configuration space > but VDUSE_DEV_SET_CONFIG means that the VDUSE kernel code is caching an > outdated copy. > I'm not sure. Should the device and driver be able to access the same fields concurrently? > Again, I don't think it's a problem for existing devices in the VIRTIO > specification. But I'm not 100% sure and future devices might require > what I've described, so the VDUSE_DEV_SET_CONFIG interface could become > a problem. > If so, maybe a new interface can be added at that time. The VDUSE_DEV_GET_CONFIG might be better, but I still did not find a good way for failure handling. Thanks, Yongji
On Thu, Jul 01, 2021 at 06:00:48PM +0800, Yongji Xie wrote: > On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote: > > > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote: > > > > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len) > > > > > + { > > > > > + int fd; > > > > > + void *addr; > > > > > + size_t size; > > > > > + struct vduse_iotlb_entry entry; > > > > > + > > > > > + entry.start = iova; > > > > > + entry.last = iova + 1; > > > > > > > > Why +1? > > > > > > > > I expected the request to include *len so that VDUSE can create a bounce > > > > buffer for the full iova range, if necessary. > > > > > > > > > > The function is used to translate iova to va. And the *len is not > > > specified by the caller. Instead, it's used to tell the caller the > > > length of the contiguous iova region from the specified iova. And the > > > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first > > > overlapped iova region. So using iova + 1 should be enough here. > > > > Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I > > wonder why userspace needs to assign a value at all if it's always +1. > > > > If we need to get some iova regions in the specified range, we need > the entry.last field. For example, we can use [0, ULONG_MAX] to get > the first overlapped iova region which might be [4096, 8192]. But in > this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to > get the iova region including the specified iova. I see, thanks for explaining! > > > > > + return addr + iova - entry.start; > > > > > + } > > > > > + > > > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features > > > > > > > > Are these VIRTIO feature bits? Please explain how feature negotiation > > > > works. There must be a way for userspace to report the device's > > > > supported feature bits to the kernel. > > > > > > > > > > Yes, these are VIRTIO feature bits. Userspace will specify the > > > device's supported feature bits when creating a new VDUSE device with > > > ioctl(VDUSE_CREATE_DEV). > > > > Can the VDUSE device influence feature bit negotiation? For example, if > > the VDUSE virtio-blk device does not implement discard/write-zeroes, how > > does QEMU or the guest find out about this? > > > > There is a "features" field in struct vduse_dev_config which is used > to do feature negotiation. This approach is more restrictive than required by the VIRTIO specification: "The device SHOULD accept any valid subset of features the driver accepts, otherwise it MUST fail to set the FEATURES_OK device status bit when the driver writes it." https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-130002 The spec allows a device to reject certain subsets of features. For example, if feature B depends on feature A and can only be enabled when feature A is also enabled. From your description I think VDUSE would accept feature B without feature A since the device implementation has no opportunity to fail negotiation with custom logic. Ideally VDUSE would send a SET_FEATURES message to userspace, allowing the device implementation full flexibility in which subsets of features to accept. This is a corner case. Many or maybe even all existing VIRTIO devices don't need this flexibility, but I want to point out this limitation in the VDUSE interface because it may cause issues in the future. > > > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt > > > > > > > > Does this mean the contents of the configuration space are cached by > > > > VDUSE? > > > > > > Yes, but the kernel will also store the same contents. > > > > > > > The downside is that the userspace code cannot generate the > > > > contents on demand. Most devices doin't need to generate the contents > > > > on demand, so I think this is okay but I had expected a different > > > > interface: > > > > > > > > kernel->userspace VDUSE_DEV_GET_CONFIG > > > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ > > > > > > > > > > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We > > > will need lots of modification of virtio codes to support that. So to > > > make it simple, we choose this way: > > > > > > userspace -> kernel VDUSE_DEV_SET_CONFIG > > > userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ > > > > > > > I think you can leave it the way it is, but I wanted to mention this in > > > > case someone thinks it's important to support generating the contents of > > > > the configuration space on demand. > > > > > > > > > > Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and > > > VDUSE_DEV_INJECT_CONFIG_IRQ achieve that? > > > > If the contents of the configuration space change continuously, then the > > VDUSE_DEV_SET_CONFIG approach is inefficient and might have race > > conditions. For example, imagine a device where the driver can read a > > timer from the configuration space. I think the VIRTIO device model > > allows that although I'm not aware of any devices that do something like > > it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be > > called frequently to keep the timer value updated even though the guest > > driver probably isn't accessing it. > > > > OK, I get you now. Since the VIRTIO specification says "Device > configuration space is generally used for rarely-changing or > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG > ioctl should not be called frequently. The spec uses MUST and other terms to define the precise requirements. Here the language (especially the word "generally") is weaker and means there may be exceptions. Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG approach is reads that have side-effects. For example, imagine a field containing an error code if the device encounters a problem unrelated to a specific virtqueue request. Reading from this field resets the error code to 0, saving the driver an extra configuration space write access and possibly race conditions. It isn't possible to implement those semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it makes me think that the interface does not allow full VIRTIO semantics. > > What's worse is that there might be race conditions where other > > driver->device operations are supposed to update the configuration space > > but VDUSE_DEV_SET_CONFIG means that the VDUSE kernel code is caching an > > outdated copy. > > > > I'm not sure. Should the device and driver be able to access the same > fields concurrently? Yes. The VIRTIO spec has a generation count to handle multi-field accesses so that consistency can be ensured: https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-180004 > > > Again, I don't think it's a problem for existing devices in the VIRTIO > > specification. But I'm not 100% sure and future devices might require > > what I've described, so the VDUSE_DEV_SET_CONFIG interface could become > > a problem. > > > > If so, maybe a new interface can be added at that time. The > VDUSE_DEV_GET_CONFIG might be better, but I still did not find a good > way for failure handling. I'm not aware of the details of why the current approach was necessary, so I don't have any concrete suggestions. Sorry! Stefan
On Thu, Jul 1, 2021 at 9:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Thu, Jul 01, 2021 at 06:00:48PM +0800, Yongji Xie wrote: > > On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote: > > > > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote: > > > > > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len) > > > > > > + { > > > > > > + int fd; > > > > > > + void *addr; > > > > > > + size_t size; > > > > > > + struct vduse_iotlb_entry entry; > > > > > > + > > > > > > + entry.start = iova; > > > > > > + entry.last = iova + 1; > > > > > > > > > > Why +1? > > > > > > > > > > I expected the request to include *len so that VDUSE can create a bounce > > > > > buffer for the full iova range, if necessary. > > > > > > > > > > > > > The function is used to translate iova to va. And the *len is not > > > > specified by the caller. Instead, it's used to tell the caller the > > > > length of the contiguous iova region from the specified iova. And the > > > > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first > > > > overlapped iova region. So using iova + 1 should be enough here. > > > > > > Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I > > > wonder why userspace needs to assign a value at all if it's always +1. > > > > > > > If we need to get some iova regions in the specified range, we need > > the entry.last field. For example, we can use [0, ULONG_MAX] to get > > the first overlapped iova region which might be [4096, 8192]. But in > > this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to > > get the iova region including the specified iova. > > I see, thanks for explaining! > > > > > > > + return addr + iova - entry.start; > > > > > > + } > > > > > > + > > > > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features > > > > > > > > > > Are these VIRTIO feature bits? Please explain how feature negotiation > > > > > works. There must be a way for userspace to report the device's > > > > > supported feature bits to the kernel. > > > > > > > > > > > > > Yes, these are VIRTIO feature bits. Userspace will specify the > > > > device's supported feature bits when creating a new VDUSE device with > > > > ioctl(VDUSE_CREATE_DEV). > > > > > > Can the VDUSE device influence feature bit negotiation? For example, if > > > the VDUSE virtio-blk device does not implement discard/write-zeroes, how > > > does QEMU or the guest find out about this? > > > > > > > There is a "features" field in struct vduse_dev_config which is used > > to do feature negotiation. > > This approach is more restrictive than required by the VIRTIO > specification: > > "The device SHOULD accept any valid subset of features the driver > accepts, otherwise it MUST fail to set the FEATURES_OK device status > bit when the driver writes it." > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-130002 > > The spec allows a device to reject certain subsets of features. For > example, if feature B depends on feature A and can only be enabled when > feature A is also enabled. > > From your description I think VDUSE would accept feature B without > feature A since the device implementation has no opportunity to fail > negotiation with custom logic. > Yes, we discussed it [1] before. So I'd like to re-introduce SET_STATUS messages so that the userspace can fail feature negotiation during setting FEATURES_OK status bit. [1] https://lkml.org/lkml/2021/6/28/1587 > Ideally VDUSE would send a SET_FEATURES message to userspace, allowing > the device implementation full flexibility in which subsets of features > to accept. > > This is a corner case. Many or maybe even all existing VIRTIO devices > don't need this flexibility, but I want to point out this limitation in > the VDUSE interface because it may cause issues in the future. > > > > > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt > > > > > > > > > > Does this mean the contents of the configuration space are cached by > > > > > VDUSE? > > > > > > > > Yes, but the kernel will also store the same contents. > > > > > > > > > The downside is that the userspace code cannot generate the > > > > > contents on demand. Most devices doin't need to generate the contents > > > > > on demand, so I think this is okay but I had expected a different > > > > > interface: > > > > > > > > > > kernel->userspace VDUSE_DEV_GET_CONFIG > > > > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ > > > > > > > > > > > > > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We > > > > will need lots of modification of virtio codes to support that. So to > > > > make it simple, we choose this way: > > > > > > > > userspace -> kernel VDUSE_DEV_SET_CONFIG > > > > userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ > > > > > > > > > I think you can leave it the way it is, but I wanted to mention this in > > > > > case someone thinks it's important to support generating the contents of > > > > > the configuration space on demand. > > > > > > > > > > > > > Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and > > > > VDUSE_DEV_INJECT_CONFIG_IRQ achieve that? > > > > > > If the contents of the configuration space change continuously, then the > > > VDUSE_DEV_SET_CONFIG approach is inefficient and might have race > > > conditions. For example, imagine a device where the driver can read a > > > timer from the configuration space. I think the VIRTIO device model > > > allows that although I'm not aware of any devices that do something like > > > it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be > > > called frequently to keep the timer value updated even though the guest > > > driver probably isn't accessing it. > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device > > configuration space is generally used for rarely-changing or > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG > > ioctl should not be called frequently. > > The spec uses MUST and other terms to define the precise requirements. > Here the language (especially the word "generally") is weaker and means > there may be exceptions. > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG > approach is reads that have side-effects. For example, imagine a field > containing an error code if the device encounters a problem unrelated to > a specific virtqueue request. Reading from this field resets the error > code to 0, saving the driver an extra configuration space write access > and possibly race conditions. It isn't possible to implement those > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it > makes me think that the interface does not allow full VIRTIO semantics. > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to handle the message failure, I'm going to add a return value to virtio_config_ops.get() and virtio_cread_* API so that the error can be propagated to the virtio device driver. Then the virtio-blk device driver can be modified to handle that. Jason and Stefan, what do you think of this way? > > > What's worse is that there might be race conditions where other > > > driver->device operations are supposed to update the configuration space > > > but VDUSE_DEV_SET_CONFIG means that the VDUSE kernel code is caching an > > > outdated copy. > > > > > > > I'm not sure. Should the device and driver be able to access the same > > fields concurrently? > > Yes. The VIRTIO spec has a generation count to handle multi-field > accesses so that consistency can be ensured: > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-180004 > I see. Thanks, Yongji
在 2021/7/4 下午5:49, Yongji Xie 写道: >>> OK, I get you now. Since the VIRTIO specification says "Device >>> configuration space is generally used for rarely-changing or >>> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG >>> ioctl should not be called frequently. >> The spec uses MUST and other terms to define the precise requirements. >> Here the language (especially the word "generally") is weaker and means >> there may be exceptions. >> >> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG >> approach is reads that have side-effects. For example, imagine a field >> containing an error code if the device encounters a problem unrelated to >> a specific virtqueue request. Reading from this field resets the error >> code to 0, saving the driver an extra configuration space write access >> and possibly race conditions. It isn't possible to implement those >> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it >> makes me think that the interface does not allow full VIRTIO semantics. Note that though you're correct, my understanding is that config space is not suitable for this kind of error propagating. And it would be very hard to implement such kind of semantic in some transports. Virtqueue should be much better. As Yong Ji quoted, the config space is used for "rarely-changing or intialization-time parameters". > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to > handle the message failure, I'm going to add a return value to > virtio_config_ops.get() and virtio_cread_* API so that the error can > be propagated to the virtio device driver. Then the virtio-blk device > driver can be modified to handle that. > > Jason and Stefan, what do you think of this way? I'd like to stick to the current assumption thich get_config won't fail. That is to say, 1) maintain a config in the kernel, make sure the config space read can always succeed 2) introduce an ioctl for the vduse usersapce to update the config space. 3) we can synchronize with the vduse userspace during set_config Does this work? Thanks >
On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: > > 在 2021/7/4 下午5:49, Yongji Xie 写道: > > > > OK, I get you now. Since the VIRTIO specification says "Device > > > > configuration space is generally used for rarely-changing or > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG > > > > ioctl should not be called frequently. > > > The spec uses MUST and other terms to define the precise requirements. > > > Here the language (especially the word "generally") is weaker and means > > > there may be exceptions. > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG > > > approach is reads that have side-effects. For example, imagine a field > > > containing an error code if the device encounters a problem unrelated to > > > a specific virtqueue request. Reading from this field resets the error > > > code to 0, saving the driver an extra configuration space write access > > > and possibly race conditions. It isn't possible to implement those > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it > > > makes me think that the interface does not allow full VIRTIO semantics. > > > Note that though you're correct, my understanding is that config space is > not suitable for this kind of error propagating. And it would be very hard > to implement such kind of semantic in some transports. Virtqueue should be > much better. As Yong Ji quoted, the config space is used for > "rarely-changing or intialization-time parameters". > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to > > handle the message failure, I'm going to add a return value to > > virtio_config_ops.get() and virtio_cread_* API so that the error can > > be propagated to the virtio device driver. Then the virtio-blk device > > driver can be modified to handle that. > > > > Jason and Stefan, what do you think of this way? Why does VDUSE_DEV_GET_CONFIG need to support an error return value? The VIRTIO spec provides no way for the device to report errors from config space accesses. The QEMU virtio-pci implementation returns -1 from invalid virtio_config_read*() and silently discards virtio_config_write*() accesses. VDUSE can take the same approach with VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. > I'd like to stick to the current assumption thich get_config won't fail. > That is to say, > > 1) maintain a config in the kernel, make sure the config space read can > always succeed > 2) introduce an ioctl for the vduse usersapce to update the config space. > 3) we can synchronize with the vduse userspace during set_config > > Does this work? I noticed that caching is also allowed by the vhost-user protocol messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't know whether or not caching is in effect. The interface you outlined above requires caching. Is there a reason why the host kernel vDPA code needs to cache the configuration space? Here are the vhost-user protocol messages: Virtio device config space ^^^^^^^^^^^^^^^^^^^^^^^^^^ +--------+------+-------+---------+ | offset | size | flags | payload | +--------+------+-------+---------+ :offset: a 32-bit offset of virtio device's configuration space :size: a 32-bit configuration space access size in bytes :flags: a 32-bit value: - 0: Vhost master messages used for writeable fields - 1: Vhost master messages used for live migration :payload: Size bytes array holding the contents of the virtio device's configuration space ... ``VHOST_USER_GET_CONFIG`` :id: 24 :equivalent ioctl: N/A :master payload: virtio device config space :slave payload: virtio device config space When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is submitted by the vhost-user master to fetch the contents of the virtio device configuration space, vhost-user slave's payload size MUST match master's request, vhost-user slave uses zero length of payload to indicate an error to vhost-user master. The vhost-user master may cache the contents to avoid repeated ``VHOST_USER_GET_CONFIG`` calls. ``VHOST_USER_SET_CONFIG`` :id: 25 :equivalent ioctl: N/A :master payload: virtio device config space :slave payload: N/A When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is submitted by the vhost-user master when the Guest changes the virtio device configuration space and also can be used for live migration on the destination host. The vhost-user slave must check the flags field, and slaves MUST NOT accept SET_CONFIG for read-only configuration space fields unless the live migration bit is set. Stefan
在 2021/7/5 下午8:49, Stefan Hajnoczi 写道: > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: >> 在 2021/7/4 下午5:49, Yongji Xie 写道: >>>>> OK, I get you now. Since the VIRTIO specification says "Device >>>>> configuration space is generally used for rarely-changing or >>>>> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG >>>>> ioctl should not be called frequently. >>>> The spec uses MUST and other terms to define the precise requirements. >>>> Here the language (especially the word "generally") is weaker and means >>>> there may be exceptions. >>>> >>>> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG >>>> approach is reads that have side-effects. For example, imagine a field >>>> containing an error code if the device encounters a problem unrelated to >>>> a specific virtqueue request. Reading from this field resets the error >>>> code to 0, saving the driver an extra configuration space write access >>>> and possibly race conditions. It isn't possible to implement those >>>> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it >>>> makes me think that the interface does not allow full VIRTIO semantics. >> >> Note that though you're correct, my understanding is that config space is >> not suitable for this kind of error propagating. And it would be very hard >> to implement such kind of semantic in some transports. Virtqueue should be >> much better. As Yong Ji quoted, the config space is used for >> "rarely-changing or intialization-time parameters". >> >> >>> Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to >>> handle the message failure, I'm going to add a return value to >>> virtio_config_ops.get() and virtio_cread_* API so that the error can >>> be propagated to the virtio device driver. Then the virtio-blk device >>> driver can be modified to handle that. >>> >>> Jason and Stefan, what do you think of this way? > Why does VDUSE_DEV_GET_CONFIG need to support an error return value? > > The VIRTIO spec provides no way for the device to report errors from > config space accesses. > > The QEMU virtio-pci implementation returns -1 from invalid > virtio_config_read*() and silently discards virtio_config_write*() > accesses. > > VDUSE can take the same approach with > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. > >> I'd like to stick to the current assumption thich get_config won't fail. >> That is to say, >> >> 1) maintain a config in the kernel, make sure the config space read can >> always succeed >> 2) introduce an ioctl for the vduse usersapce to update the config space. >> 3) we can synchronize with the vduse userspace during set_config >> >> Does this work? > I noticed that caching is also allowed by the vhost-user protocol > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't > know whether or not caching is in effect. The interface you outlined > above requires caching. > > Is there a reason why the host kernel vDPA code needs to cache the > configuration space? Because: 1) Kernel can not wait forever in get_config(), this is the major difference with vhost-user. 2) Stick to the current assumption that virtio_cread() should always succeed. Thanks > > Here are the vhost-user protocol messages: > > Virtio device config space > ^^^^^^^^^^^^^^^^^^^^^^^^^^ > > +--------+------+-------+---------+ > | offset | size | flags | payload | > +--------+------+-------+---------+ > > :offset: a 32-bit offset of virtio device's configuration space > > :size: a 32-bit configuration space access size in bytes > > :flags: a 32-bit value: > - 0: Vhost master messages used for writeable fields > - 1: Vhost master messages used for live migration > > :payload: Size bytes array holding the contents of the virtio > device's configuration space > > ... > > ``VHOST_USER_GET_CONFIG`` > :id: 24 > :equivalent ioctl: N/A > :master payload: virtio device config space > :slave payload: virtio device config space > > When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is > submitted by the vhost-user master to fetch the contents of the > virtio device configuration space, vhost-user slave's payload size > MUST match master's request, vhost-user slave uses zero length of > payload to indicate an error to vhost-user master. The vhost-user > master may cache the contents to avoid repeated > ``VHOST_USER_GET_CONFIG`` calls. > > ``VHOST_USER_SET_CONFIG`` > :id: 25 > :equivalent ioctl: N/A > :master payload: virtio device config space > :slave payload: N/A > > When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is > submitted by the vhost-user master when the Guest changes the virtio > device configuration space and also can be used for live migration > on the destination host. The vhost-user slave must check the flags > field, and slaves MUST NOT accept SET_CONFIG for read-only > configuration space fields unless the live migration bit is set. > > Stefan
On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道: > > > > > OK, I get you now. Since the VIRTIO specification says "Device > > > > > configuration space is generally used for rarely-changing or > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG > > > > > ioctl should not be called frequently. > > > > The spec uses MUST and other terms to define the precise requirements. > > > > Here the language (especially the word "generally") is weaker and means > > > > there may be exceptions. > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG > > > > approach is reads that have side-effects. For example, imagine a field > > > > containing an error code if the device encounters a problem unrelated to > > > > a specific virtqueue request. Reading from this field resets the error > > > > code to 0, saving the driver an extra configuration space write access > > > > and possibly race conditions. It isn't possible to implement those > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it > > > > makes me think that the interface does not allow full VIRTIO semantics. > > > > > > Note that though you're correct, my understanding is that config space is > > not suitable for this kind of error propagating. And it would be very hard > > to implement such kind of semantic in some transports. Virtqueue should be > > much better. As Yong Ji quoted, the config space is used for > > "rarely-changing or intialization-time parameters". > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to > > > handle the message failure, I'm going to add a return value to > > > virtio_config_ops.get() and virtio_cread_* API so that the error can > > > be propagated to the virtio device driver. Then the virtio-blk device > > > driver can be modified to handle that. > > > > > > Jason and Stefan, what do you think of this way? > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value? > We add a timeout and return error in case userspace never replies to the message. > The VIRTIO spec provides no way for the device to report errors from > config space accesses. > > The QEMU virtio-pci implementation returns -1 from invalid > virtio_config_read*() and silently discards virtio_config_write*() > accesses. > > VDUSE can take the same approach with > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. > I noticed that virtio_config_read*() only returns -1 when we access a invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail when we access a valid field. Not sure if it's ok to silently ignore this kind of error. Thanks, Yongji
On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote: > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道: > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: > > > 在 2021/7/4 下午5:49, Yongji Xie 写道: > > > > > > OK, I get you now. Since the VIRTIO specification says "Device > > > > > > configuration space is generally used for rarely-changing or > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG > > > > > > ioctl should not be called frequently. > > > > > The spec uses MUST and other terms to define the precise requirements. > > > > > Here the language (especially the word "generally") is weaker and means > > > > > there may be exceptions. > > > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG > > > > > approach is reads that have side-effects. For example, imagine a field > > > > > containing an error code if the device encounters a problem unrelated to > > > > > a specific virtqueue request. Reading from this field resets the error > > > > > code to 0, saving the driver an extra configuration space write access > > > > > and possibly race conditions. It isn't possible to implement those > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it > > > > > makes me think that the interface does not allow full VIRTIO semantics. > > > > > > Note that though you're correct, my understanding is that config space is > > > not suitable for this kind of error propagating. And it would be very hard > > > to implement such kind of semantic in some transports. Virtqueue should be > > > much better. As Yong Ji quoted, the config space is used for > > > "rarely-changing or intialization-time parameters". > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to > > > > handle the message failure, I'm going to add a return value to > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can > > > > be propagated to the virtio device driver. Then the virtio-blk device > > > > driver can be modified to handle that. > > > > > > > > Jason and Stefan, what do you think of this way? > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value? > > > > The VIRTIO spec provides no way for the device to report errors from > > config space accesses. > > > > The QEMU virtio-pci implementation returns -1 from invalid > > virtio_config_read*() and silently discards virtio_config_write*() > > accesses. > > > > VDUSE can take the same approach with > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. > > > > > I'd like to stick to the current assumption thich get_config won't fail. > > > That is to say, > > > > > > 1) maintain a config in the kernel, make sure the config space read can > > > always succeed > > > 2) introduce an ioctl for the vduse usersapce to update the config space. > > > 3) we can synchronize with the vduse userspace during set_config > > > > > > Does this work? > > I noticed that caching is also allowed by the vhost-user protocol > > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't > > know whether or not caching is in effect. The interface you outlined > > above requires caching. > > > > Is there a reason why the host kernel vDPA code needs to cache the > > configuration space? > > > Because: > > 1) Kernel can not wait forever in get_config(), this is the major difference > with vhost-user. virtio_cread() can sleep: #define virtio_cread(vdev, structname, member, ptr) \ do { \ typeof(((structname*)0)->member) virtio_cread_v; \ \ might_sleep(); \ ^^^^^^^^^^^^^^ Which code path cannot sleep? > 2) Stick to the current assumption that virtio_cread() should always > succeed. That can be done by reading -1 (like QEMU does) when the read fails. Stefan
On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote: > On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道: > > > > > > OK, I get you now. Since the VIRTIO specification says "Device > > > > > > configuration space is generally used for rarely-changing or > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG > > > > > > ioctl should not be called frequently. > > > > > The spec uses MUST and other terms to define the precise requirements. > > > > > Here the language (especially the word "generally") is weaker and means > > > > > there may be exceptions. > > > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG > > > > > approach is reads that have side-effects. For example, imagine a field > > > > > containing an error code if the device encounters a problem unrelated to > > > > > a specific virtqueue request. Reading from this field resets the error > > > > > code to 0, saving the driver an extra configuration space write access > > > > > and possibly race conditions. It isn't possible to implement those > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it > > > > > makes me think that the interface does not allow full VIRTIO semantics. > > > > > > > > > Note that though you're correct, my understanding is that config space is > > > not suitable for this kind of error propagating. And it would be very hard > > > to implement such kind of semantic in some transports. Virtqueue should be > > > much better. As Yong Ji quoted, the config space is used for > > > "rarely-changing or intialization-time parameters". > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to > > > > handle the message failure, I'm going to add a return value to > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can > > > > be propagated to the virtio device driver. Then the virtio-blk device > > > > driver can be modified to handle that. > > > > > > > > Jason and Stefan, what do you think of this way? > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value? > > > > We add a timeout and return error in case userspace never replies to > the message. > > > The VIRTIO spec provides no way for the device to report errors from > > config space accesses. > > > > The QEMU virtio-pci implementation returns -1 from invalid > > virtio_config_read*() and silently discards virtio_config_write*() > > accesses. > > > > VDUSE can take the same approach with > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. > > > > I noticed that virtio_config_read*() only returns -1 when we access a > invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail > when we access a valid field. Not sure if it's ok to silently ignore > this kind of error. That's a good point but it's a general VIRTIO issue. Any device implementation (QEMU userspace, hardware vDPA, etc) can fail, so the VIRTIO specification needs to provide a way for the driver to detect this. If userspace violates the contract then VDUSE needs to mark the device broken. QEMU's device emulation does something similar with the vdev->broken flag. The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by vDPA/VDUSE to indicate that the device is not operational and must be reset. The driver code may still process the -1 value read from the configuration space. Hopefully this isn't a problem. There is currently no VIRTIO interface besides DEVICE_NEEDS_RESET to indicate configuration space access failures. On the other hand, drivers need to handle malicious devices so they should be able to cope with the -1 value anyway. Stefan
On Tue, Jul 6, 2021 at 6:22 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote: > > On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道: > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device > > > > > > > configuration space is generally used for rarely-changing or > > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG > > > > > > > ioctl should not be called frequently. > > > > > > The spec uses MUST and other terms to define the precise requirements. > > > > > > Here the language (especially the word "generally") is weaker and means > > > > > > there may be exceptions. > > > > > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG > > > > > > approach is reads that have side-effects. For example, imagine a field > > > > > > containing an error code if the device encounters a problem unrelated to > > > > > > a specific virtqueue request. Reading from this field resets the error > > > > > > code to 0, saving the driver an extra configuration space write access > > > > > > and possibly race conditions. It isn't possible to implement those > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it > > > > > > makes me think that the interface does not allow full VIRTIO semantics. > > > > > > > > > > > > Note that though you're correct, my understanding is that config space is > > > > not suitable for this kind of error propagating. And it would be very hard > > > > to implement such kind of semantic in some transports. Virtqueue should be > > > > much better. As Yong Ji quoted, the config space is used for > > > > "rarely-changing or intialization-time parameters". > > > > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to > > > > > handle the message failure, I'm going to add a return value to > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can > > > > > be propagated to the virtio device driver. Then the virtio-blk device > > > > > driver can be modified to handle that. > > > > > > > > > > Jason and Stefan, what do you think of this way? > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value? > > > > > > > We add a timeout and return error in case userspace never replies to > > the message. > > > > > The VIRTIO spec provides no way for the device to report errors from > > > config space accesses. > > > > > > The QEMU virtio-pci implementation returns -1 from invalid > > > virtio_config_read*() and silently discards virtio_config_write*() > > > accesses. > > > > > > VDUSE can take the same approach with > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. > > > > > > > I noticed that virtio_config_read*() only returns -1 when we access a > > invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail > > when we access a valid field. Not sure if it's ok to silently ignore > > this kind of error. > > That's a good point but it's a general VIRTIO issue. Any device > implementation (QEMU userspace, hardware vDPA, etc) can fail, so the > VIRTIO specification needs to provide a way for the driver to detect > this. > > If userspace violates the contract then VDUSE needs to mark the device > broken. QEMU's device emulation does something similar with the > vdev->broken flag. > > The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by > vDPA/VDUSE to indicate that the device is not operational and must be > reset. > It might be a solution. But DEVICE_NEEDS_RESET is not implemented currently. So I'm thinking whether it's ok to add a check of DEVICE_NEEDS_RESET status bit in probe function of virtio device driver (e.g. virtio-blk driver). Then VDUSE can make use of it to fail device initailization when configuration space access failed. Thanks, Yongji
在 2021/7/7 下午4:55, Stefan Hajnoczi 写道: > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote: >> 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道: >>> On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote: >>>> On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: >>>>> On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote: >>>>>> 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道: >>>>>>> On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: >>>>>>>> 在 2021/7/4 下午5:49, Yongji Xie 写道: >>>>>>>>>>> OK, I get you now. Since the VIRTIO specification says "Device >>>>>>>>>>> configuration space is generally used for rarely-changing or >>>>>>>>>>> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG >>>>>>>>>>> ioctl should not be called frequently. >>>>>>>>>> The spec uses MUST and other terms to define the precise requirements. >>>>>>>>>> Here the language (especially the word "generally") is weaker and means >>>>>>>>>> there may be exceptions. >>>>>>>>>> >>>>>>>>>> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG >>>>>>>>>> approach is reads that have side-effects. For example, imagine a field >>>>>>>>>> containing an error code if the device encounters a problem unrelated to >>>>>>>>>> a specific virtqueue request. Reading from this field resets the error >>>>>>>>>> code to 0, saving the driver an extra configuration space write access >>>>>>>>>> and possibly race conditions. It isn't possible to implement those >>>>>>>>>> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it >>>>>>>>>> makes me think that the interface does not allow full VIRTIO semantics. >>>>>>>> Note that though you're correct, my understanding is that config space is >>>>>>>> not suitable for this kind of error propagating. And it would be very hard >>>>>>>> to implement such kind of semantic in some transports. Virtqueue should be >>>>>>>> much better. As Yong Ji quoted, the config space is used for >>>>>>>> "rarely-changing or intialization-time parameters". >>>>>>>> >>>>>>>> >>>>>>>>> Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to >>>>>>>>> handle the message failure, I'm going to add a return value to >>>>>>>>> virtio_config_ops.get() and virtio_cread_* API so that the error can >>>>>>>>> be propagated to the virtio device driver. Then the virtio-blk device >>>>>>>>> driver can be modified to handle that. >>>>>>>>> >>>>>>>>> Jason and Stefan, what do you think of this way? >>>>>>> Why does VDUSE_DEV_GET_CONFIG need to support an error return value? >>>>>>> >>>>>>> The VIRTIO spec provides no way for the device to report errors from >>>>>>> config space accesses. >>>>>>> >>>>>>> The QEMU virtio-pci implementation returns -1 from invalid >>>>>>> virtio_config_read*() and silently discards virtio_config_write*() >>>>>>> accesses. >>>>>>> >>>>>>> VDUSE can take the same approach with >>>>>>> VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. >>>>>>> >>>>>>>> I'd like to stick to the current assumption thich get_config won't fail. >>>>>>>> That is to say, >>>>>>>> >>>>>>>> 1) maintain a config in the kernel, make sure the config space read can >>>>>>>> always succeed >>>>>>>> 2) introduce an ioctl for the vduse usersapce to update the config space. >>>>>>>> 3) we can synchronize with the vduse userspace during set_config >>>>>>>> >>>>>>>> Does this work? >>>>>>> I noticed that caching is also allowed by the vhost-user protocol >>>>>>> messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't >>>>>>> know whether or not caching is in effect. The interface you outlined >>>>>>> above requires caching. >>>>>>> >>>>>>> Is there a reason why the host kernel vDPA code needs to cache the >>>>>>> configuration space? >>>>>> Because: >>>>>> >>>>>> 1) Kernel can not wait forever in get_config(), this is the major difference >>>>>> with vhost-user. >>>>> virtio_cread() can sleep: >>>>> >>>>> #define virtio_cread(vdev, structname, member, ptr) \ >>>>> do { \ >>>>> typeof(((structname*)0)->member) virtio_cread_v; \ >>>>> \ >>>>> might_sleep(); \ >>>>> ^^^^^^^^^^^^^^ >>>>> >>>>> Which code path cannot sleep? >>>> Well, it can sleep but it can't sleep forever. For VDUSE, a >>>> buggy/malicious userspace may refuse to respond to the get_config. >>>> >>>> It looks to me the ideal case, with the current virtio spec, for VDUSE is to >>>> >>>> 1) maintain the device and its state in the kernel, userspace may sync >>>> with the kernel device via ioctls >>>> 2) offload the datapath (virtqueue) to the userspace >>>> >>>> This seems more robust and safe than simply relaying everything to >>>> userspace and waiting for its response. >>>> >>>> And we know for sure this model can work, an example is TUN/TAP: >>>> netdevice is abstracted in the kernel and datapath is done via >>>> sendmsg()/recvmsg(). >>>> >>>> Maintaining the config in the kernel follows this model and it can >>>> simplify the device generation implementation. >>>> >>>> For config space write, it requires more thought but fortunately it's >>>> not commonly used. So VDUSE can choose to filter out the >>>> device/features that depends on the config write. >>> This is the problem. There are other messages like SET_FEATURES where I >>> guess we'll face the same challenge. >> >> Probably not, userspace device can tell the kernel about the device_features >> and mandated_features during creation, and the feature negotiation could be >> done purely in the kernel without bothering the userspace. (For some reason I drop the list accidentally, adding them back, sorry) > Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous > interface where the driver waits for the device. It depends on how we define "synchronous" here. If I understand correctly, the spec doesn't expect there will be any kind of failure for the operation of set_status itself. Instead, anytime it want any synchronization, it should be done via get_status(): 1) re-read device status to make sure FEATURES_OK is set during feature negotiation 2) re-read device status to be 0 to make sure the device has finish the reset > > VDUSE currently doesn't wait for the device emulation process to handle > this message (no reply is needed) but I think this is a mistake because > VDUSE is not following the VIRTIO device model. With the trick that is done for FEATURES_OK above, I think we don't need to wait for the reply. If userspace takes too long to respond, it can be detected since get_status() doesn't return the expected value for long time. And for the case that needs a timeout, we probably can use NEEDS_RESET. > > I strongly suggest designing the VDUSE interface to match the VIRTIO > device model (or at least the vDPA interface). I fully agree with you and that is what we want to achieve in this series. > Defining a custom > interface for VDUSE avoids some implementation complexity and makes it > easier to deal with untrusted userspace, but it's impossible to > implement certain VIRTIO features or devices. It also fragments VIRTIO > more than necessary; we have a standard, let's stick to it. Yes. > >>> I agree that caching the contents of configuration space in the kernel >>> helps, but if there are other VDUSE messages with the same problem then >>> an attacker will exploit them instead. >>> >>> I think a systematic solution is needed. It would be necessary to >>> enumerate the virtio_vdpa and vhost_vdpa cases separately to figure out >>> where VDUSE messages are synchronous/time-sensitive. >> >> This is the case of reset and needs more thought. We should stick a >> consistent uAPI for the userspace. >> >> For vhost-vDPA, it needs synchronzied with the userspace and we can wait for >> ever. > The VMM should still be able to handle signals when a vhost_vdpa ioctl > is waiting for a reply from the VDUSE userspace process. Or if that's > not possible then there needs to be a way to force disconnection from > VDUSE so the VMM can be killed. Note that VDUSE works under vDPA bus, so vhost should be transport to VDUSE. But we can detect this via whether or not the bounce buffer is used. Thanks > > Stefan
On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote: > > 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道: > > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote: > > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道: > > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote: > > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote: > > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道: > > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: > > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道: > > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device > > > > > > > > > > > > configuration space is generally used for rarely-changing or > > > > > > > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG > > > > > > > > > > > > ioctl should not be called frequently. > > > > > > > > > > > The spec uses MUST and other terms to define the precise requirements. > > > > > > > > > > > Here the language (especially the word "generally") is weaker and means > > > > > > > > > > > there may be exceptions. > > > > > > > > > > > > > > > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG > > > > > > > > > > > approach is reads that have side-effects. For example, imagine a field > > > > > > > > > > > containing an error code if the device encounters a problem unrelated to > > > > > > > > > > > a specific virtqueue request. Reading from this field resets the error > > > > > > > > > > > code to 0, saving the driver an extra configuration space write access > > > > > > > > > > > and possibly race conditions. It isn't possible to implement those > > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it > > > > > > > > > > > makes me think that the interface does not allow full VIRTIO semantics. > > > > > > > > > Note that though you're correct, my understanding is that config space is > > > > > > > > > not suitable for this kind of error propagating. And it would be very hard > > > > > > > > > to implement such kind of semantic in some transports. Virtqueue should be > > > > > > > > > much better. As Yong Ji quoted, the config space is used for > > > > > > > > > "rarely-changing or intialization-time parameters". > > > > > > > > > > > > > > > > > > > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to > > > > > > > > > > handle the message failure, I'm going to add a return value to > > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can > > > > > > > > > > be propagated to the virtio device driver. Then the virtio-blk device > > > > > > > > > > driver can be modified to handle that. > > > > > > > > > > > > > > > > > > > > Jason and Stefan, what do you think of this way? > > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value? > > > > > > > > > > > > > > > > The VIRTIO spec provides no way for the device to report errors from > > > > > > > > config space accesses. > > > > > > > > > > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid > > > > > > > > virtio_config_read*() and silently discards virtio_config_write*() > > > > > > > > accesses. > > > > > > > > > > > > > > > > VDUSE can take the same approach with > > > > > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. > > > > > > > > > > > > > > > > > I'd like to stick to the current assumption thich get_config won't fail. > > > > > > > > > That is to say, > > > > > > > > > > > > > > > > > > 1) maintain a config in the kernel, make sure the config space read can > > > > > > > > > always succeed > > > > > > > > > 2) introduce an ioctl for the vduse usersapce to update the config space. > > > > > > > > > 3) we can synchronize with the vduse userspace during set_config > > > > > > > > > > > > > > > > > > Does this work? > > > > > > > > I noticed that caching is also allowed by the vhost-user protocol > > > > > > > > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't > > > > > > > > know whether or not caching is in effect. The interface you outlined > > > > > > > > above requires caching. > > > > > > > > > > > > > > > > Is there a reason why the host kernel vDPA code needs to cache the > > > > > > > > configuration space? > > > > > > > Because: > > > > > > > > > > > > > > 1) Kernel can not wait forever in get_config(), this is the major difference > > > > > > > with vhost-user. > > > > > > virtio_cread() can sleep: > > > > > > > > > > > > #define virtio_cread(vdev, structname, member, ptr) \ > > > > > > do { \ > > > > > > typeof(((structname*)0)->member) virtio_cread_v; \ > > > > > > \ > > > > > > might_sleep(); \ > > > > > > ^^^^^^^^^^^^^^ > > > > > > > > > > > > Which code path cannot sleep? > > > > > Well, it can sleep but it can't sleep forever. For VDUSE, a > > > > > buggy/malicious userspace may refuse to respond to the get_config. > > > > > > > > > > It looks to me the ideal case, with the current virtio spec, for VDUSE is to > > > > > > > > > > 1) maintain the device and its state in the kernel, userspace may sync > > > > > with the kernel device via ioctls > > > > > 2) offload the datapath (virtqueue) to the userspace > > > > > > > > > > This seems more robust and safe than simply relaying everything to > > > > > userspace and waiting for its response. > > > > > > > > > > And we know for sure this model can work, an example is TUN/TAP: > > > > > netdevice is abstracted in the kernel and datapath is done via > > > > > sendmsg()/recvmsg(). > > > > > > > > > > Maintaining the config in the kernel follows this model and it can > > > > > simplify the device generation implementation. > > > > > > > > > > For config space write, it requires more thought but fortunately it's > > > > > not commonly used. So VDUSE can choose to filter out the > > > > > device/features that depends on the config write. > > > > This is the problem. There are other messages like SET_FEATURES where I > > > > guess we'll face the same challenge. > > > > > > Probably not, userspace device can tell the kernel about the device_features > > > and mandated_features during creation, and the feature negotiation could be > > > done purely in the kernel without bothering the userspace. > > > (For some reason I drop the list accidentally, adding them back, sorry) > > > > Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous > > interface where the driver waits for the device. > > > It depends on how we define "synchronous" here. If I understand correctly, > the spec doesn't expect there will be any kind of failure for the operation > of set_status itself. > > Instead, anytime it want any synchronization, it should be done via > get_status(): > > 1) re-read device status to make sure FEATURES_OK is set during feature > negotiation > 2) re-read device status to be 0 to make sure the device has finish the > reset > > > > > > VDUSE currently doesn't wait for the device emulation process to handle > > this message (no reply is needed) but I think this is a mistake because > > VDUSE is not following the VIRTIO device model. > > > With the trick that is done for FEATURES_OK above, I think we don't need to > wait for the reply. > > If userspace takes too long to respond, it can be detected since > get_status() doesn't return the expected value for long time. > > And for the case that needs a timeout, we probably can use NEEDS_RESET. I think you're right. get_status is the synchronization point, not set_status. Currently there is no VDUSE GET_STATUS message. The VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so that the device emulation program can participate in emulating the Device Status field. This change could affect VDUSE's VIRTIO feature interface since the device emulation program can reject features by not setting FEATURES_OK. Stefan
在 2021/7/7 下午11:54, Stefan Hajnoczi 写道: > On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote: >> 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道: >>> On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote: >>>> 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道: >>>>> On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote: >>>>>> On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: >>>>>>> On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote: >>>>>>>> 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道: >>>>>>>>> On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: >>>>>>>>>> 在 2021/7/4 下午5:49, Yongji Xie 写道: >>>>>>>>>>>>> OK, I get you now. Since the VIRTIO specification says "Device >>>>>>>>>>>>> configuration space is generally used for rarely-changing or >>>>>>>>>>>>> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG >>>>>>>>>>>>> ioctl should not be called frequently. >>>>>>>>>>>> The spec uses MUST and other terms to define the precise requirements. >>>>>>>>>>>> Here the language (especially the word "generally") is weaker and means >>>>>>>>>>>> there may be exceptions. >>>>>>>>>>>> >>>>>>>>>>>> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG >>>>>>>>>>>> approach is reads that have side-effects. For example, imagine a field >>>>>>>>>>>> containing an error code if the device encounters a problem unrelated to >>>>>>>>>>>> a specific virtqueue request. Reading from this field resets the error >>>>>>>>>>>> code to 0, saving the driver an extra configuration space write access >>>>>>>>>>>> and possibly race conditions. It isn't possible to implement those >>>>>>>>>>>> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it >>>>>>>>>>>> makes me think that the interface does not allow full VIRTIO semantics. >>>>>>>>>> Note that though you're correct, my understanding is that config space is >>>>>>>>>> not suitable for this kind of error propagating. And it would be very hard >>>>>>>>>> to implement such kind of semantic in some transports. Virtqueue should be >>>>>>>>>> much better. As Yong Ji quoted, the config space is used for >>>>>>>>>> "rarely-changing or intialization-time parameters". >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to >>>>>>>>>>> handle the message failure, I'm going to add a return value to >>>>>>>>>>> virtio_config_ops.get() and virtio_cread_* API so that the error can >>>>>>>>>>> be propagated to the virtio device driver. Then the virtio-blk device >>>>>>>>>>> driver can be modified to handle that. >>>>>>>>>>> >>>>>>>>>>> Jason and Stefan, what do you think of this way? >>>>>>>>> Why does VDUSE_DEV_GET_CONFIG need to support an error return value? >>>>>>>>> >>>>>>>>> The VIRTIO spec provides no way for the device to report errors from >>>>>>>>> config space accesses. >>>>>>>>> >>>>>>>>> The QEMU virtio-pci implementation returns -1 from invalid >>>>>>>>> virtio_config_read*() and silently discards virtio_config_write*() >>>>>>>>> accesses. >>>>>>>>> >>>>>>>>> VDUSE can take the same approach with >>>>>>>>> VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. >>>>>>>>> >>>>>>>>>> I'd like to stick to the current assumption thich get_config won't fail. >>>>>>>>>> That is to say, >>>>>>>>>> >>>>>>>>>> 1) maintain a config in the kernel, make sure the config space read can >>>>>>>>>> always succeed >>>>>>>>>> 2) introduce an ioctl for the vduse usersapce to update the config space. >>>>>>>>>> 3) we can synchronize with the vduse userspace during set_config >>>>>>>>>> >>>>>>>>>> Does this work? >>>>>>>>> I noticed that caching is also allowed by the vhost-user protocol >>>>>>>>> messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't >>>>>>>>> know whether or not caching is in effect. The interface you outlined >>>>>>>>> above requires caching. >>>>>>>>> >>>>>>>>> Is there a reason why the host kernel vDPA code needs to cache the >>>>>>>>> configuration space? >>>>>>>> Because: >>>>>>>> >>>>>>>> 1) Kernel can not wait forever in get_config(), this is the major difference >>>>>>>> with vhost-user. >>>>>>> virtio_cread() can sleep: >>>>>>> >>>>>>> #define virtio_cread(vdev, structname, member, ptr) \ >>>>>>> do { \ >>>>>>> typeof(((structname*)0)->member) virtio_cread_v; \ >>>>>>> \ >>>>>>> might_sleep(); \ >>>>>>> ^^^^^^^^^^^^^^ >>>>>>> >>>>>>> Which code path cannot sleep? >>>>>> Well, it can sleep but it can't sleep forever. For VDUSE, a >>>>>> buggy/malicious userspace may refuse to respond to the get_config. >>>>>> >>>>>> It looks to me the ideal case, with the current virtio spec, for VDUSE is to >>>>>> >>>>>> 1) maintain the device and its state in the kernel, userspace may sync >>>>>> with the kernel device via ioctls >>>>>> 2) offload the datapath (virtqueue) to the userspace >>>>>> >>>>>> This seems more robust and safe than simply relaying everything to >>>>>> userspace and waiting for its response. >>>>>> >>>>>> And we know for sure this model can work, an example is TUN/TAP: >>>>>> netdevice is abstracted in the kernel and datapath is done via >>>>>> sendmsg()/recvmsg(). >>>>>> >>>>>> Maintaining the config in the kernel follows this model and it can >>>>>> simplify the device generation implementation. >>>>>> >>>>>> For config space write, it requires more thought but fortunately it's >>>>>> not commonly used. So VDUSE can choose to filter out the >>>>>> device/features that depends on the config write. >>>>> This is the problem. There are other messages like SET_FEATURES where I >>>>> guess we'll face the same challenge. >>>> Probably not, userspace device can tell the kernel about the device_features >>>> and mandated_features during creation, and the feature negotiation could be >>>> done purely in the kernel without bothering the userspace. >> >> (For some reason I drop the list accidentally, adding them back, sorry) >> >> >>> Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous >>> interface where the driver waits for the device. >> >> It depends on how we define "synchronous" here. If I understand correctly, >> the spec doesn't expect there will be any kind of failure for the operation >> of set_status itself. >> >> Instead, anytime it want any synchronization, it should be done via >> get_status(): >> >> 1) re-read device status to make sure FEATURES_OK is set during feature >> negotiation >> 2) re-read device status to be 0 to make sure the device has finish the >> reset >> >> >>> VDUSE currently doesn't wait for the device emulation process to handle >>> this message (no reply is needed) but I think this is a mistake because >>> VDUSE is not following the VIRTIO device model. >> >> With the trick that is done for FEATURES_OK above, I think we don't need to >> wait for the reply. >> >> If userspace takes too long to respond, it can be detected since >> get_status() doesn't return the expected value for long time. >> >> And for the case that needs a timeout, we probably can use NEEDS_RESET. > I think you're right. get_status is the synchronization point, not > set_status. > > Currently there is no VDUSE GET_STATUS message. The > VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so > that the device emulation program can participate in emulating the > Device Status field. I'm not sure I get this, but it is what has been done? +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status) +{ + struct vduse_dev *dev = vdpa_to_vduse(vdpa); + bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK); + + dev->status = status; + + if (dev->started == started) + return; + + dev->started = started; + if (dev->started) { + vduse_dev_start_dataplane(dev); + } else { + vduse_dev_reset(dev); + vduse_dev_stop_dataplane(dev); + } +} But the looks not correct: 1) !DRIVER_OK doesn't means a reset? 2) Need to deal with FEATURES_OK Thanks > This change could affect VDUSE's VIRTIO feature > interface since the device emulation program can reject features by not > setting FEATURES_OK. > > Stefan
On Thu, Jul 08, 2021 at 12:17:56PM +0800, Jason Wang wrote: > > 在 2021/7/7 下午11:54, Stefan Hajnoczi 写道: > > On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote: > > > 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道: > > > > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote: > > > > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道: > > > > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote: > > > > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote: > > > > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道: > > > > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: > > > > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道: > > > > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device > > > > > > > > > > > > > > configuration space is generally used for rarely-changing or > > > > > > > > > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG > > > > > > > > > > > > > > ioctl should not be called frequently. > > > > > > > > > > > > > The spec uses MUST and other terms to define the precise requirements. > > > > > > > > > > > > > Here the language (especially the word "generally") is weaker and means > > > > > > > > > > > > > there may be exceptions. > > > > > > > > > > > > > > > > > > > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG > > > > > > > > > > > > > approach is reads that have side-effects. For example, imagine a field > > > > > > > > > > > > > containing an error code if the device encounters a problem unrelated to > > > > > > > > > > > > > a specific virtqueue request. Reading from this field resets the error > > > > > > > > > > > > > code to 0, saving the driver an extra configuration space write access > > > > > > > > > > > > > and possibly race conditions. It isn't possible to implement those > > > > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it > > > > > > > > > > > > > makes me think that the interface does not allow full VIRTIO semantics. > > > > > > > > > > > Note that though you're correct, my understanding is that config space is > > > > > > > > > > > not suitable for this kind of error propagating. And it would be very hard > > > > > > > > > > > to implement such kind of semantic in some transports. Virtqueue should be > > > > > > > > > > > much better. As Yong Ji quoted, the config space is used for > > > > > > > > > > > "rarely-changing or intialization-time parameters". > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to > > > > > > > > > > > > handle the message failure, I'm going to add a return value to > > > > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can > > > > > > > > > > > > be propagated to the virtio device driver. Then the virtio-blk device > > > > > > > > > > > > driver can be modified to handle that. > > > > > > > > > > > > > > > > > > > > > > > > Jason and Stefan, what do you think of this way? > > > > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value? > > > > > > > > > > > > > > > > > > > > The VIRTIO spec provides no way for the device to report errors from > > > > > > > > > > config space accesses. > > > > > > > > > > > > > > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid > > > > > > > > > > virtio_config_read*() and silently discards virtio_config_write*() > > > > > > > > > > accesses. > > > > > > > > > > > > > > > > > > > > VDUSE can take the same approach with > > > > > > > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. > > > > > > > > > > > > > > > > > > > > > I'd like to stick to the current assumption thich get_config won't fail. > > > > > > > > > > > That is to say, > > > > > > > > > > > > > > > > > > > > > > 1) maintain a config in the kernel, make sure the config space read can > > > > > > > > > > > always succeed > > > > > > > > > > > 2) introduce an ioctl for the vduse usersapce to update the config space. > > > > > > > > > > > 3) we can synchronize with the vduse userspace during set_config > > > > > > > > > > > > > > > > > > > > > > Does this work? > > > > > > > > > > I noticed that caching is also allowed by the vhost-user protocol > > > > > > > > > > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't > > > > > > > > > > know whether or not caching is in effect. The interface you outlined > > > > > > > > > > above requires caching. > > > > > > > > > > > > > > > > > > > > Is there a reason why the host kernel vDPA code needs to cache the > > > > > > > > > > configuration space? > > > > > > > > > Because: > > > > > > > > > > > > > > > > > > 1) Kernel can not wait forever in get_config(), this is the major difference > > > > > > > > > with vhost-user. > > > > > > > > virtio_cread() can sleep: > > > > > > > > > > > > > > > > #define virtio_cread(vdev, structname, member, ptr) \ > > > > > > > > do { \ > > > > > > > > typeof(((structname*)0)->member) virtio_cread_v; \ > > > > > > > > \ > > > > > > > > might_sleep(); \ > > > > > > > > ^^^^^^^^^^^^^^ > > > > > > > > > > > > > > > > Which code path cannot sleep? > > > > > > > Well, it can sleep but it can't sleep forever. For VDUSE, a > > > > > > > buggy/malicious userspace may refuse to respond to the get_config. > > > > > > > > > > > > > > It looks to me the ideal case, with the current virtio spec, for VDUSE is to > > > > > > > > > > > > > > 1) maintain the device and its state in the kernel, userspace may sync > > > > > > > with the kernel device via ioctls > > > > > > > 2) offload the datapath (virtqueue) to the userspace > > > > > > > > > > > > > > This seems more robust and safe than simply relaying everything to > > > > > > > userspace and waiting for its response. > > > > > > > > > > > > > > And we know for sure this model can work, an example is TUN/TAP: > > > > > > > netdevice is abstracted in the kernel and datapath is done via > > > > > > > sendmsg()/recvmsg(). > > > > > > > > > > > > > > Maintaining the config in the kernel follows this model and it can > > > > > > > simplify the device generation implementation. > > > > > > > > > > > > > > For config space write, it requires more thought but fortunately it's > > > > > > > not commonly used. So VDUSE can choose to filter out the > > > > > > > device/features that depends on the config write. > > > > > > This is the problem. There are other messages like SET_FEATURES where I > > > > > > guess we'll face the same challenge. > > > > > Probably not, userspace device can tell the kernel about the device_features > > > > > and mandated_features during creation, and the feature negotiation could be > > > > > done purely in the kernel without bothering the userspace. > > > > > > (For some reason I drop the list accidentally, adding them back, sorry) > > > > > > > > > > Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous > > > > interface where the driver waits for the device. > > > > > > It depends on how we define "synchronous" here. If I understand correctly, > > > the spec doesn't expect there will be any kind of failure for the operation > > > of set_status itself. > > > > > > Instead, anytime it want any synchronization, it should be done via > > > get_status(): > > > > > > 1) re-read device status to make sure FEATURES_OK is set during feature > > > negotiation > > > 2) re-read device status to be 0 to make sure the device has finish the > > > reset > > > > > > > > > > VDUSE currently doesn't wait for the device emulation process to handle > > > > this message (no reply is needed) but I think this is a mistake because > > > > VDUSE is not following the VIRTIO device model. > > > > > > With the trick that is done for FEATURES_OK above, I think we don't need to > > > wait for the reply. > > > > > > If userspace takes too long to respond, it can be detected since > > > get_status() doesn't return the expected value for long time. > > > > > > And for the case that needs a timeout, we probably can use NEEDS_RESET. > > I think you're right. get_status is the synchronization point, not > > set_status. > > > > Currently there is no VDUSE GET_STATUS message. The > > VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so > > that the device emulation program can participate in emulating the > > Device Status field. > > > I'm not sure I get this, but it is what has been done? > > +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status) > +{ > + struct vduse_dev *dev = vdpa_to_vduse(vdpa); > + bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK); > + > + dev->status = status; > + > + if (dev->started == started) > + return; > + > + dev->started = started; > + if (dev->started) { > + vduse_dev_start_dataplane(dev); > + } else { > + vduse_dev_reset(dev); > + vduse_dev_stop_dataplane(dev); > + } > +} > > > But the looks not correct: > > 1) !DRIVER_OK doesn't means a reset? > 2) Need to deal with FEATURES_OK I'm not sure if this reply was to me or to Yongji Xie? Currently vduse_vdpa_set_status() does not allow the device emulation program to participate fully in Device Status field changes. It hides the status bits and only sends VDUSE_START/STOP_DATAPLANE. I suggest having GET_STATUS/SET_STATUS messages instead, allowing the device emulation program to handle these parts of the VIRTIO device model (e.g. rejecting combinations of features that are mutually exclusive). Stefan
On Wed, Jul 07, 2021 at 05:09:13PM +0800, Yongji Xie wrote: > On Tue, Jul 6, 2021 at 6:22 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote: > > > On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: > > > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道: > > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device > > > > > > > > configuration space is generally used for rarely-changing or > > > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG > > > > > > > > ioctl should not be called frequently. > > > > > > > The spec uses MUST and other terms to define the precise requirements. > > > > > > > Here the language (especially the word "generally") is weaker and means > > > > > > > there may be exceptions. > > > > > > > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG > > > > > > > approach is reads that have side-effects. For example, imagine a field > > > > > > > containing an error code if the device encounters a problem unrelated to > > > > > > > a specific virtqueue request. Reading from this field resets the error > > > > > > > code to 0, saving the driver an extra configuration space write access > > > > > > > and possibly race conditions. It isn't possible to implement those > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it > > > > > > > makes me think that the interface does not allow full VIRTIO semantics. > > > > > > > > > > > > > > > Note that though you're correct, my understanding is that config space is > > > > > not suitable for this kind of error propagating. And it would be very hard > > > > > to implement such kind of semantic in some transports. Virtqueue should be > > > > > much better. As Yong Ji quoted, the config space is used for > > > > > "rarely-changing or intialization-time parameters". > > > > > > > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to > > > > > > handle the message failure, I'm going to add a return value to > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can > > > > > > be propagated to the virtio device driver. Then the virtio-blk device > > > > > > driver can be modified to handle that. > > > > > > > > > > > > Jason and Stefan, what do you think of this way? > > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value? > > > > > > > > > > We add a timeout and return error in case userspace never replies to > > > the message. > > > > > > > The VIRTIO spec provides no way for the device to report errors from > > > > config space accesses. > > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid > > > > virtio_config_read*() and silently discards virtio_config_write*() > > > > accesses. > > > > > > > > VDUSE can take the same approach with > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. > > > > > > > > > > I noticed that virtio_config_read*() only returns -1 when we access a > > > invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail > > > when we access a valid field. Not sure if it's ok to silently ignore > > > this kind of error. > > > > That's a good point but it's a general VIRTIO issue. Any device > > implementation (QEMU userspace, hardware vDPA, etc) can fail, so the > > VIRTIO specification needs to provide a way for the driver to detect > > this. > > > > If userspace violates the contract then VDUSE needs to mark the device > > broken. QEMU's device emulation does something similar with the > > vdev->broken flag. > > > > The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by > > vDPA/VDUSE to indicate that the device is not operational and must be > > reset. > > > > It might be a solution. But DEVICE_NEEDS_RESET is not implemented > currently. So I'm thinking whether it's ok to add a check of > DEVICE_NEEDS_RESET status bit in probe function of virtio device > driver (e.g. virtio-blk driver). Then VDUSE can make use of it to fail > device initailization when configuration space access failed. Okay. Stefan
On Thu, Jul 8, 2021 at 5:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Thu, Jul 08, 2021 at 12:17:56PM +0800, Jason Wang wrote: > > > > 在 2021/7/7 下午11:54, Stefan Hajnoczi 写道: > > > On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote: > > > > 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道: > > > > > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote: > > > > > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道: > > > > > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote: > > > > > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote: > > > > > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道: > > > > > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote: > > > > > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道: > > > > > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device > > > > > > > > > > > > > > > configuration space is generally used for rarely-changing or > > > > > > > > > > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG > > > > > > > > > > > > > > > ioctl should not be called frequently. > > > > > > > > > > > > > > The spec uses MUST and other terms to define the precise requirements. > > > > > > > > > > > > > > Here the language (especially the word "generally") is weaker and means > > > > > > > > > > > > > > there may be exceptions. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG > > > > > > > > > > > > > > approach is reads that have side-effects. For example, imagine a field > > > > > > > > > > > > > > containing an error code if the device encounters a problem unrelated to > > > > > > > > > > > > > > a specific virtqueue request. Reading from this field resets the error > > > > > > > > > > > > > > code to 0, saving the driver an extra configuration space write access > > > > > > > > > > > > > > and possibly race conditions. It isn't possible to implement those > > > > > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it > > > > > > > > > > > > > > makes me think that the interface does not allow full VIRTIO semantics. > > > > > > > > > > > > Note that though you're correct, my understanding is that config space is > > > > > > > > > > > > not suitable for this kind of error propagating. And it would be very hard > > > > > > > > > > > > to implement such kind of semantic in some transports. Virtqueue should be > > > > > > > > > > > > much better. As Yong Ji quoted, the config space is used for > > > > > > > > > > > > "rarely-changing or intialization-time parameters". > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to > > > > > > > > > > > > > handle the message failure, I'm going to add a return value to > > > > > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can > > > > > > > > > > > > > be propagated to the virtio device driver. Then the virtio-blk device > > > > > > > > > > > > > driver can be modified to handle that. > > > > > > > > > > > > > > > > > > > > > > > > > > Jason and Stefan, what do you think of this way? > > > > > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value? > > > > > > > > > > > > > > > > > > > > > > The VIRTIO spec provides no way for the device to report errors from > > > > > > > > > > > config space accesses. > > > > > > > > > > > > > > > > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid > > > > > > > > > > > virtio_config_read*() and silently discards virtio_config_write*() > > > > > > > > > > > accesses. > > > > > > > > > > > > > > > > > > > > > > VDUSE can take the same approach with > > > > > > > > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG. > > > > > > > > > > > > > > > > > > > > > > > I'd like to stick to the current assumption thich get_config won't fail. > > > > > > > > > > > > That is to say, > > > > > > > > > > > > > > > > > > > > > > > > 1) maintain a config in the kernel, make sure the config space read can > > > > > > > > > > > > always succeed > > > > > > > > > > > > 2) introduce an ioctl for the vduse usersapce to update the config space. > > > > > > > > > > > > 3) we can synchronize with the vduse userspace during set_config > > > > > > > > > > > > > > > > > > > > > > > > Does this work? > > > > > > > > > > > I noticed that caching is also allowed by the vhost-user protocol > > > > > > > > > > > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't > > > > > > > > > > > know whether or not caching is in effect. The interface you outlined > > > > > > > > > > > above requires caching. > > > > > > > > > > > > > > > > > > > > > > Is there a reason why the host kernel vDPA code needs to cache the > > > > > > > > > > > configuration space? > > > > > > > > > > Because: > > > > > > > > > > > > > > > > > > > > 1) Kernel can not wait forever in get_config(), this is the major difference > > > > > > > > > > with vhost-user. > > > > > > > > > virtio_cread() can sleep: > > > > > > > > > > > > > > > > > > #define virtio_cread(vdev, structname, member, ptr) \ > > > > > > > > > do { \ > > > > > > > > > typeof(((structname*)0)->member) virtio_cread_v; \ > > > > > > > > > \ > > > > > > > > > might_sleep(); \ > > > > > > > > > ^^^^^^^^^^^^^^ > > > > > > > > > > > > > > > > > > Which code path cannot sleep? > > > > > > > > Well, it can sleep but it can't sleep forever. For VDUSE, a > > > > > > > > buggy/malicious userspace may refuse to respond to the get_config. > > > > > > > > > > > > > > > > It looks to me the ideal case, with the current virtio spec, for VDUSE is to > > > > > > > > > > > > > > > > 1) maintain the device and its state in the kernel, userspace may sync > > > > > > > > with the kernel device via ioctls > > > > > > > > 2) offload the datapath (virtqueue) to the userspace > > > > > > > > > > > > > > > > This seems more robust and safe than simply relaying everything to > > > > > > > > userspace and waiting for its response. > > > > > > > > > > > > > > > > And we know for sure this model can work, an example is TUN/TAP: > > > > > > > > netdevice is abstracted in the kernel and datapath is done via > > > > > > > > sendmsg()/recvmsg(). > > > > > > > > > > > > > > > > Maintaining the config in the kernel follows this model and it can > > > > > > > > simplify the device generation implementation. > > > > > > > > > > > > > > > > For config space write, it requires more thought but fortunately it's > > > > > > > > not commonly used. So VDUSE can choose to filter out the > > > > > > > > device/features that depends on the config write. > > > > > > > This is the problem. There are other messages like SET_FEATURES where I > > > > > > > guess we'll face the same challenge. > > > > > > Probably not, userspace device can tell the kernel about the device_features > > > > > > and mandated_features during creation, and the feature negotiation could be > > > > > > done purely in the kernel without bothering the userspace. > > > > > > > > (For some reason I drop the list accidentally, adding them back, sorry) > > > > > > > > > > > > > Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous > > > > > interface where the driver waits for the device. > > > > > > > > It depends on how we define "synchronous" here. If I understand correctly, > > > > the spec doesn't expect there will be any kind of failure for the operation > > > > of set_status itself. > > > > > > > > Instead, anytime it want any synchronization, it should be done via > > > > get_status(): > > > > > > > > 1) re-read device status to make sure FEATURES_OK is set during feature > > > > negotiation > > > > 2) re-read device status to be 0 to make sure the device has finish the > > > > reset > > > > > > > > > > > > > VDUSE currently doesn't wait for the device emulation process to handle > > > > > this message (no reply is needed) but I think this is a mistake because > > > > > VDUSE is not following the VIRTIO device model. > > > > > > > > With the trick that is done for FEATURES_OK above, I think we don't need to > > > > wait for the reply. > > > > > > > > If userspace takes too long to respond, it can be detected since > > > > get_status() doesn't return the expected value for long time. > > > > > > > > And for the case that needs a timeout, we probably can use NEEDS_RESET. > > > I think you're right. get_status is the synchronization point, not > > > set_status. > > > > > > Currently there is no VDUSE GET_STATUS message. The > > > VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so > > > that the device emulation program can participate in emulating the > > > Device Status field. > > > > > > I'm not sure I get this, but it is what has been done? > > > > +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status) > > +{ > > + struct vduse_dev *dev = vdpa_to_vduse(vdpa); > > + bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK); > > + > > + dev->status = status; > > + > > + if (dev->started == started) > > + return; > > + > > + dev->started = started; > > + if (dev->started) { > > + vduse_dev_start_dataplane(dev); > > + } else { > > + vduse_dev_reset(dev); > > + vduse_dev_stop_dataplane(dev); > > + } > > +} > > > > > > But the looks not correct: > > > > 1) !DRIVER_OK doesn't means a reset? > > 2) Need to deal with FEATURES_OK > > I'm not sure if this reply was to me or to Yongji Xie? > > Currently vduse_vdpa_set_status() does not allow the device emulation > program to participate fully in Device Status field changes. It hides > the status bits and only sends VDUSE_START/STOP_DATAPLANE. > > I suggest having GET_STATUS/SET_STATUS messages instead, allowing the > device emulation program to handle these parts of the VIRTIO device > model (e.g. rejecting combinations of features that are mutually > exclusive). > Yes, I will do this in the next version. Thanks, Yongi
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 0b5eefed027e..c432be070f67 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -27,6 +27,7 @@ place where this information is gathered. iommu media/index sysfs-platform_profile + vduse .. only:: subproject and html diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst new file mode 100644 index 000000000000..2f9cd1a4e530 --- /dev/null +++ b/Documentation/userspace-api/vduse.rst @@ -0,0 +1,222 @@ +================================== +VDUSE - "vDPA Device in Userspace" +================================== + +vDPA (virtio data path acceleration) device is a device that uses a +datapath which complies with the virtio specifications with vendor +specific control path. vDPA devices can be both physically located on +the hardware or emulated by software. VDUSE is a framework that makes it +possible to implement software-emulated vDPA devices in userspace. And +to make it simple, the emulated vDPA device's control path is handled in +the kernel and only the data path is implemented in the userspace. + +Note that only virtio block device is supported by VDUSE framework now, +which can reduce security risks when the userspace process that implements +the data path is run by an unprivileged user. The Support for other device +types can be added after the security issue is clarified or fixed in the future. + +Start/Stop VDUSE devices +------------------------ + +VDUSE devices are started as follows: + +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on + /dev/vduse/control. + +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first + messages will arrive while attaching the VDUSE instance to vDPA bus. + +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE + instance to vDPA bus. + +VDUSE devices are stopped as follows: + +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE + instance from vDPA bus. + +2. Close the file descriptor referring to /dev/vduse/$NAME + +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on + /dev/vduse/control + +The netlink messages metioned above can be sent via vdpa tool in iproute2 +or use the below sample codes: + +.. code-block:: c + + static int netlink_add_vduse(const char *name, enum vdpa_command cmd) + { + struct nl_sock *nlsock; + struct nl_msg *msg; + int famid; + + nlsock = nl_socket_alloc(); + if (!nlsock) + return -ENOMEM; + + if (genl_connect(nlsock)) + goto free_sock; + + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME); + if (famid < 0) + goto close_sock; + + msg = nlmsg_alloc(); + if (!msg) + goto close_sock; + + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, cmd, 0)) + goto nla_put_failure; + + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name); + if (cmd == VDPA_CMD_DEV_NEW) + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse"); + + if (nl_send_sync(nlsock, msg)) + goto close_sock; + + nl_close(nlsock); + nl_socket_free(nlsock); + + return 0; + nla_put_failure: + nlmsg_free(msg); + close_sock: + nl_close(nlsock); + free_sock: + nl_socket_free(nlsock); + return -1; + } + +How VDUSE works +--------------- + +Since the emuldated vDPA device's control path is handled in the kernel, +a message-based communication protocol and few types of control messages +are introduced by VDUSE framework to make userspace be aware of the data +path related changes: + +- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace + +- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane + +- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane + +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB + +Userspace needs to read()/write() on /dev/vduse/$NAME to receive/reply +those control messages from/to VDUSE kernel module as follows: + +.. code-block:: c + + static int vduse_message_handler(int dev_fd) + { + int len; + struct vduse_dev_request req; + struct vduse_dev_response resp; + + len = read(dev_fd, &req, sizeof(req)); + if (len != sizeof(req)) + return -1; + + resp.request_id = req.request_id; + + switch (req.type) { + + /* handle different types of message */ + + } + + if (req.flags & VDUSE_REQ_FLAGS_NO_REPLY) + return 0; + + len = write(dev_fd, &resp, sizeof(resp)); + if (len != sizeof(resp)) + return -1; + + return 0; + } + +After VDUSE_START_DATAPLANE messages is received, userspace should start the +dataplane processing with the help of some ioctls on /dev/vduse/$NAME: + +- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region. + Userspace can access this iova region by passing fd and corresponding size, offset, + perm to mmap(). For example: + +.. code-block:: c + + static int perm_to_prot(uint8_t perm) + { + int prot = 0; + + switch (perm) { + case VDUSE_ACCESS_WO: + prot |= PROT_WRITE; + break; + case VDUSE_ACCESS_RO: + prot |= PROT_READ; + break; + case VDUSE_ACCESS_RW: + prot |= PROT_READ | PROT_WRITE; + break; + } + + return prot; + } + + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len) + { + int fd; + void *addr; + size_t size; + struct vduse_iotlb_entry entry; + + entry.start = iova; + entry.last = iova + 1; + fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry); + if (fd < 0) + return NULL; + + size = entry.last - entry.start + 1; + *len = entry.last - iova + 1; + addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED, + fd, entry.offset); + close(fd); + if (addr == MAP_FAILED) + return NULL; + + /* do something to cache this iova region */ + + return addr + iova - entry.start; + } + +- VDUSE_DEV_GET_FEATURES: Get the negotiated features + +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt + +- VDUSE_VQ_GET_INFO: Get the specified virtqueue's metadata + +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used + by VDUSE kernel module to notify userspace to consume the vring. + +- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue + +MMU-based IOMMU Driver +---------------------- + +VDUSE framework implements an MMU-based on-chip IOMMU driver to support +mapping the kernel DMA buffer into the userspace iova region dynamically. +This is mainly designed for virtio-vdpa case (kernel virtio drivers). + +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA). +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer +so that the userspace process is able to use its virtual address to access +the DMA buffer in kernel. + +And to avoid security issue, a bounce-buffering mechanism is introduced to +prevent userspace accessing the original buffer directly which may contain other +kernel data. During the mapping, unmapping, the driver will copy the data from +the original buffer to the bounce buffer and back, depending on the direction of +the transfer. And the bounce-buffer addresses will be mapped into the user address +space instead of the original one.
VDUSE (vDPA Device in Userspace) is a framework to support implementing software-emulated vDPA devices in userspace. This document is intended to clarify the VDUSE design and usage. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> --- Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/vduse.rst | 222 ++++++++++++++++++++++++++++++++++ 2 files changed, 223 insertions(+) create mode 100644 Documentation/userspace-api/vduse.rst