diff mbox series

[v2] vhost: introduce mdev based hardware backend

Message ID 20191022095230.2514-1-tiwei.bie@intel.com (mailing list archive)
State New, archived
Headers show
Series [v2] vhost: introduce mdev based hardware backend | expand

Commit Message

Tiwei Bie Oct. 22, 2019, 9:52 a.m. UTC
This patch introduces a mdev based hardware vhost backend.
This backend is built on top of the same abstraction used
in virtio-mdev and provides a generic vhost interface for
userspace to accelerate the virtio devices in guest.

This backend is implemented as a mdev device driver on top
of the same mdev device ops used in virtio-mdev but using
a different mdev class id, and it will register the device
as a VFIO device for userspace to use. Userspace can setup
the IOMMU with the existing VFIO container/group APIs and
then get the device fd with the device name. After getting
the device fd of this device, userspace can use vhost ioctls
to setup the backend.

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
This patch depends on below series:
https://lkml.org/lkml/2019/10/17/286

v1 -> v2:
- Replace _SET_STATE with _SET_STATUS (MST);
- Check status bits at each step (MST);
- Report the max ring size and max number of queues (MST);
- Add missing MODULE_DEVICE_TABLE (Jason);
- Only support the network backend w/o multiqueue for now;
- Some minor fixes and improvements;
- Rebase on top of virtio-mdev series v4;

RFC v4 -> v1:
- Implement vhost-mdev as a mdev device driver directly and
  connect it to VFIO container/group. (Jason);
- Pass ring addresses as GPAs/IOVAs in vhost-mdev to avoid
  meaningless HVA->GPA translations (Jason);

RFC v3 -> RFC v4:
- Build vhost-mdev on top of the same abstraction used by
  virtio-mdev (Jason);
- Introduce vhost fd and pass VFIO fd via SET_BACKEND ioctl (MST);

RFC v2 -> RFC v3:
- Reuse vhost's ioctls instead of inventing a VFIO regions/irqs
  based vhost protocol on top of vfio-mdev (Jason);

RFC v1 -> RFC v2:
- Introduce a new VFIO device type to build a vhost protocol
  on top of vfio-mdev;

 drivers/vfio/mdev/mdev_core.c |  12 +
 drivers/vhost/Kconfig         |   9 +
 drivers/vhost/Makefile        |   3 +
 drivers/vhost/mdev.c          | 415 ++++++++++++++++++++++++++++++++++
 include/linux/mdev.h          |   3 +
 include/uapi/linux/vhost.h    |  13 ++
 6 files changed, 455 insertions(+)
 create mode 100644 drivers/vhost/mdev.c

Comments

Jason Wang Oct. 22, 2019, 1:30 p.m. UTC | #1
On 2019/10/22 下午5:52, Tiwei Bie wrote:
> This patch introduces a mdev based hardware vhost backend.
> This backend is built on top of the same abstraction used
> in virtio-mdev and provides a generic vhost interface for
> userspace to accelerate the virtio devices in guest.
>
> This backend is implemented as a mdev device driver on top
> of the same mdev device ops used in virtio-mdev but using
> a different mdev class id, and it will register the device
> as a VFIO device for userspace to use. Userspace can setup
> the IOMMU with the existing VFIO container/group APIs and
> then get the device fd with the device name. After getting
> the device fd of this device, userspace can use vhost ioctls
> to setup the backend.
>
> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> ---
> This patch depends on below series:
> https://lkml.org/lkml/2019/10/17/286
>
> v1 -> v2:
> - Replace _SET_STATE with _SET_STATUS (MST);
> - Check status bits at each step (MST);
> - Report the max ring size and max number of queues (MST);
> - Add missing MODULE_DEVICE_TABLE (Jason);
> - Only support the network backend w/o multiqueue for now;


Any idea on how to extend it to support devices other than net? I think 
we want a generic API or an API that could be made generic in the future.

Do we want to e.g having a generic vhost mdev for all kinds of devices 
or introducing e.g vhost-net-mdev and vhost-scsi-mdev?


> - Some minor fixes and improvements;
> - Rebase on top of virtio-mdev series v4;
>
> RFC v4 -> v1:
> - Implement vhost-mdev as a mdev device driver directly and
>    connect it to VFIO container/group. (Jason);
> - Pass ring addresses as GPAs/IOVAs in vhost-mdev to avoid
>    meaningless HVA->GPA translations (Jason);
>
> RFC v3 -> RFC v4:
> - Build vhost-mdev on top of the same abstraction used by
>    virtio-mdev (Jason);
> - Introduce vhost fd and pass VFIO fd via SET_BACKEND ioctl (MST);
>
> RFC v2 -> RFC v3:
> - Reuse vhost's ioctls instead of inventing a VFIO regions/irqs
>    based vhost protocol on top of vfio-mdev (Jason);
>
> RFC v1 -> RFC v2:
> - Introduce a new VFIO device type to build a vhost protocol
>    on top of vfio-mdev;
>
>   drivers/vfio/mdev/mdev_core.c |  12 +
>   drivers/vhost/Kconfig         |   9 +
>   drivers/vhost/Makefile        |   3 +
>   drivers/vhost/mdev.c          | 415 ++++++++++++++++++++++++++++++++++
>   include/linux/mdev.h          |   3 +
>   include/uapi/linux/vhost.h    |  13 ++
>   6 files changed, 455 insertions(+)
>   create mode 100644 drivers/vhost/mdev.c
>
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 5834f6b7c7a5..2963f65e6648 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -69,6 +69,18 @@ void mdev_set_virtio_ops(struct mdev_device *mdev,
>   }
>   EXPORT_SYMBOL(mdev_set_virtio_ops);
>   
> +/* Specify the vhost device ops for the mdev device, this
> + * must be called during create() callback for vhost mdev device.
> + */
> +void mdev_set_vhost_ops(struct mdev_device *mdev,
> +			const struct virtio_mdev_device_ops *vhost_ops)
> +{
> +	WARN_ON(mdev->class_id);
> +	mdev->class_id = MDEV_CLASS_ID_VHOST;
> +	mdev->device_ops = vhost_ops;
> +}
> +EXPORT_SYMBOL(mdev_set_vhost_ops);
> +
>   const void *mdev_get_dev_ops(struct mdev_device *mdev)
>   {
>   	return mdev->device_ops;
> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> index 3d03ccbd1adc..7b5c2f655af7 100644
> --- a/drivers/vhost/Kconfig
> +++ b/drivers/vhost/Kconfig
> @@ -34,6 +34,15 @@ config VHOST_VSOCK
>   	To compile this driver as a module, choose M here: the module will be called
>   	vhost_vsock.
>   
> +config VHOST_MDEV
> +	tristate "Vhost driver for Mediated devices"
> +	depends on EVENTFD && VFIO && VFIO_MDEV
> +	select VHOST
> +	default n
> +	---help---
> +	Say M here to enable the vhost_mdev module for use with
> +	the mediated device based hardware vhost accelerators.
> +
>   config VHOST
>   	tristate
>   	---help---
> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
> index 6c6df24f770c..ad9c0f8c6d8c 100644
> --- a/drivers/vhost/Makefile
> +++ b/drivers/vhost/Makefile
> @@ -10,4 +10,7 @@ vhost_vsock-y := vsock.o
>   
>   obj-$(CONFIG_VHOST_RING) += vringh.o
>   
> +obj-$(CONFIG_VHOST_MDEV) += vhost_mdev.o
> +vhost_mdev-y := mdev.o
> +
>   obj-$(CONFIG_VHOST)	+= vhost.o
> diff --git a/drivers/vhost/mdev.c b/drivers/vhost/mdev.c
> new file mode 100644
> index 000000000000..5f9cae61018c
> --- /dev/null
> +++ b/drivers/vhost/mdev.c
> @@ -0,0 +1,415 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2018-2019 Intel Corporation.
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/kernel.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mdev.h>
> +#include <linux/module.h>
> +#include <linux/vfio.h>
> +#include <linux/vhost.h>
> +#include <linux/virtio_mdev.h>
> +#include <linux/virtio_ids.h>
> +
> +#include "vhost.h"
> +
> +/* Currently, only network backend w/o multiqueue is supported. */
> +#define VHOST_MDEV_VQ_MAX	2
> +
> +struct vhost_mdev {
> +	/* The lock is to protect this structure. */
> +	struct mutex mutex;
> +	struct vhost_dev dev;
> +	struct vhost_virtqueue *vqs;
> +	int nvqs;
> +	u64 status;
> +	u64 features;
> +	u64 acked_features;
> +	bool opened;
> +	struct mdev_device *mdev;
> +};
> +
> +static void handle_vq_kick(struct vhost_work *work)
> +{
> +	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> +						  poll.work);
> +	struct vhost_mdev *m = container_of(vq->dev, struct vhost_mdev, dev);
> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
> +
> +	ops->kick_vq(m->mdev, vq - m->vqs);
> +}
> +
> +static irqreturn_t vhost_mdev_virtqueue_cb(void *private)
> +{
> +	struct vhost_virtqueue *vq = private;
> +	struct eventfd_ctx *call_ctx = vq->call_ctx;
> +
> +	if (call_ctx)
> +		eventfd_signal(call_ctx, 1);
> +	return IRQ_HANDLED;
> +}
> +
> +static void vhost_mdev_reset(struct vhost_mdev *m)
> +{
> +	struct mdev_device *mdev = m->mdev;
> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev);
> +
> +	m->status = 0;
> +	return ops->set_status(mdev, m->status);
> +}
> +
> +static long vhost_mdev_get_status(struct vhost_mdev *m, u8 __user *statusp)
> +{
> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
> +	struct mdev_device *mdev = m->mdev;
> +	u8 status;
> +
> +	status = ops->get_status(mdev);
> +	m->status = status;
> +
> +	if (copy_to_user(statusp, &status, sizeof(status)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +static long vhost_mdev_set_status(struct vhost_mdev *m, u8 __user *statusp)
> +{
> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
> +	struct mdev_device *mdev = m->mdev;
> +	u8 status;
> +
> +	if (copy_from_user(&status, statusp, sizeof(status)))
> +		return -EFAULT;
> +
> +	/*
> +	 * Userspace shouldn't remove status bits unless reset the
> +	 * status to 0.
> +	 */
> +	if (status != 0 && (m->status & ~status) != 0)
> +		return -EINVAL;


We don't cache vq ready information but we cache status and features 
here, any reason for this?


> +
> +	ops->set_status(mdev, status);
> +	m->status = ops->get_status(mdev);
> +
> +	return 0;
> +}
> +
> +static long vhost_mdev_get_features(struct vhost_mdev *m, u64 __user *featurep)
> +{
> +	if (copy_to_user(featurep, &m->features, sizeof(m->features)))
> +		return -EFAULT;


As discussed in previous version do we need to filter out MQ feature here?


> +	return 0;
> +}
> +
> +static long vhost_mdev_set_features(struct vhost_mdev *m, u64 __user *featurep)
> +{
> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
> +	struct mdev_device *mdev = m->mdev;
> +	u64 features;
> +
> +	/*
> +	 * It's not allowed to change the features after they have
> +	 * been negotiated.
> +	 */
> +	if (m->status & VIRTIO_CONFIG_S_FEATURES_OK)
> +		return -EPERM;


-EBUSY?


> +
> +	if (copy_from_user(&features, featurep, sizeof(features)))
> +		return -EFAULT;
> +
> +	if (features & ~m->features)
> +		return -EINVAL;
> +
> +	m->acked_features = features;
> +	if (ops->set_features(mdev, m->acked_features))
> +		return -ENODEV;


-EINVAL should be better, this would be more obvious for parent that 
wants to force any feature.


> +
> +	return 0;
> +}
> +
> +static long vhost_mdev_get_vring_num(struct vhost_mdev *m, u16 __user *argp)
> +{
> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
> +	struct mdev_device *mdev = m->mdev;
> +	u16 num;
> +
> +	num = ops->get_vq_num_max(mdev);
> +
> +	if (copy_to_user(argp, &num, sizeof(num)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
> +static long vhost_mdev_get_queue_num(struct vhost_mdev *m, u32 __user *argp)
> +{
> +	u32 nvqs = m->nvqs;
> +
> +	if (copy_to_user(argp, &nvqs, sizeof(nvqs)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
> +static long vhost_mdev_vring_ioctl(struct vhost_mdev *m, unsigned int cmd,
> +				   void __user *argp)
> +{
> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
> +	struct mdev_device *mdev = m->mdev;
> +	struct virtio_mdev_callback cb;
> +	struct vhost_virtqueue *vq;
> +	struct vhost_vring_state s;
> +	u32 idx;
> +	long r;
> +
> +	r = get_user(idx, (u32 __user *)argp);
> +	if (r < 0)
> +		return r;
> +	if (idx >= m->nvqs)
> +		return -ENOBUFS;
> +
> +	/*
> +	 * It's not allowed to detect and program vqs before
> +	 * features negotiation or after enabling driver.
> +	 */
> +	if (!(m->status & VIRTIO_CONFIG_S_FEATURES_OK) ||
> +	    (m->status & VIRTIO_CONFIG_S_DRIVER_OK))
> +		return -EPERM;


So the question is: is it better to do this in parent or not?


> +
> +	vq = &m->vqs[idx];
> +
> +	if (cmd == VHOST_MDEV_SET_VRING_ENABLE) {
> +		if (copy_from_user(&s, argp, sizeof(s)))
> +			return -EFAULT;
> +		ops->set_vq_ready(mdev, idx, s.num);
> +		return 0;
> +	}
> +
> +	/*
> +	 * It's not allowed to detect and program vqs with
> +	 * vqs enabled.
> +	 */
> +	if (ops->get_vq_ready(mdev, idx))
> +		return -EPERM;
> +
> +	if (cmd == VHOST_GET_VRING_BASE)
> +		vq->last_avail_idx = ops->get_vq_state(m->mdev, idx);
> +
> +	r = vhost_vring_ioctl(&m->dev, cmd, argp);
> +	if (r)
> +		return r;
> +
> +	switch (cmd) {
> +	case VHOST_SET_VRING_ADDR:
> +		/*
> +		 * In vhost-mdev, the ring addresses set by userspace should
> +		 * be the DMA addresses within the VFIO container/group.
> +		 */
> +		if (ops->set_vq_address(mdev, idx, (u64)vq->desc,
> +					(u64)vq->avail, (u64)vq->used))
> +			r = -ENODEV;
> +		break;
> +
> +	case VHOST_SET_VRING_BASE:
> +		if (ops->set_vq_state(mdev, idx, vq->last_avail_idx))
> +			r = -ENODEV;
> +		break;
> +
> +	case VHOST_SET_VRING_CALL:
> +		if (vq->call_ctx) {
> +			cb.callback = vhost_mdev_virtqueue_cb;
> +			cb.private = vq;
> +		} else {
> +			cb.callback = NULL;
> +			cb.private = NULL;
> +		}
> +		ops->set_vq_cb(mdev, idx, &cb);
> +		break;
> +
> +	case VHOST_SET_VRING_NUM:
> +		ops->set_vq_num(mdev, idx, vq->num);
> +		break;
> +	}
> +
> +	return r;
> +}
> +
> +static int vhost_mdev_open(void *device_data)
> +{
> +	struct vhost_mdev *m = device_data;
> +	struct vhost_dev *dev;
> +	struct vhost_virtqueue **vqs;
> +	int nvqs, i, r;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	mutex_lock(&m->mutex);
> +
> +	if (m->opened) {
> +		r = -EBUSY;
> +		goto err;
> +	}
> +
> +	nvqs = m->nvqs;
> +	vhost_mdev_reset(m);
> +
> +	memset(&m->dev, 0, sizeof(m->dev));
> +	memset(m->vqs, 0, nvqs * sizeof(struct vhost_virtqueue));
> +
> +	vqs = kmalloc_array(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs) {
> +		r = -ENOMEM;
> +		goto err;
> +	}
> +
> +	dev = &m->dev;
> +	for (i = 0; i < nvqs; i++) {
> +		vqs[i] = &m->vqs[i];
> +		vqs[i]->handle_kick = handle_vq_kick;
> +	}
> +	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0);
> +	m->opened = true;
> +	mutex_unlock(&m->mutex);
> +
> +	return 0;
> +
> +err:
> +	mutex_unlock(&m->mutex);
> +	module_put(THIS_MODULE);
> +	return r;
> +}
> +
> +static void vhost_mdev_release(void *device_data)
> +{
> +	struct vhost_mdev *m = device_data;
> +
> +	mutex_lock(&m->mutex);
> +	vhost_mdev_reset(m);
> +	vhost_dev_stop(&m->dev);
> +	vhost_dev_cleanup(&m->dev);
> +
> +	kfree(m->dev.vqs);
> +	m->opened = false;
> +	mutex_unlock(&m->mutex);
> +	module_put(THIS_MODULE);
> +}
> +
> +static long vhost_mdev_unlocked_ioctl(void *device_data,
> +				      unsigned int cmd, unsigned long arg)
> +{
> +	struct vhost_mdev *m = device_data;
> +	void __user *argp = (void __user *)arg;
> +	long r;
> +
> +	mutex_lock(&m->mutex);
> +
> +	switch (cmd) {
> +	case VHOST_MDEV_GET_STATUS:
> +		r = vhost_mdev_get_status(m, argp);
> +		break;
> +	case VHOST_MDEV_SET_STATUS:
> +		r = vhost_mdev_set_status(m, argp);
> +		break;
> +	case VHOST_GET_FEATURES:
> +		r = vhost_mdev_get_features(m, argp);
> +		break;
> +	case VHOST_SET_FEATURES:
> +		r = vhost_mdev_set_features(m, argp);
> +		break;
> +	case VHOST_MDEV_GET_VRING_NUM:
> +		r = vhost_mdev_get_vring_num(m, argp);
> +		break;
> +	case VHOST_MDEV_GET_QUEUE_NUM:
> +		r = vhost_mdev_get_queue_num(m, argp);
> +		break;


It's not clear to me that how this API will be used by userspace? I 
think e.g features without MQ implies the queue num here.


> +	default:
> +		r = vhost_dev_ioctl(&m->dev, cmd, argp);


I believe having SET_MEM_TABLE/SET_LOG_BASE/SET_LOG_FD  is for future 
support of those features. If it's true need add some comments on this.


> +		if (r == -ENOIOCTLCMD)
> +			r = vhost_mdev_vring_ioctl(m, cmd, argp);
> +	}
> +
> +	mutex_unlock(&m->mutex);
> +	return r;
> +}
> +
> +static const struct vfio_device_ops vfio_vhost_mdev_dev_ops = {
> +	.name		= "vfio-vhost-mdev",
> +	.open		= vhost_mdev_open,
> +	.release	= vhost_mdev_release,
> +	.ioctl		= vhost_mdev_unlocked_ioctl,
> +};
> +
> +static int vhost_mdev_probe(struct device *dev)
> +{
> +	struct mdev_device *mdev = mdev_from_dev(dev);
> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev);
> +	struct vhost_mdev *m;
> +	int nvqs, r;
> +
> +	/* Currently, only network backend is supported. */
> +	if (ops->get_device_id(mdev) != VIRTIO_ID_NET)
> +		return -ENOTSUPP;


If we decide to go with the way of vhost-net-mdev, probably need 
something smarter. E.g a vhost bus etc.


> +
> +	if (ops->get_mdev_features(mdev) != VIRTIO_MDEV_F_VERSION_1)
> +		return -ENOTSUPP;
> +
> +	m = devm_kzalloc(dev, sizeof(*m), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> +	if (!m)
> +		return -ENOMEM;
> +
> +	nvqs = VHOST_MDEV_VQ_MAX;
> +	m->nvqs = nvqs;
> +
> +	m->vqs = devm_kmalloc_array(dev, nvqs, sizeof(struct vhost_virtqueue),
> +				    GFP_KERNEL);
> +	if (!m->vqs)
> +		return -ENOMEM;


Is it better to move those allocation to open? Otherwise the memset 
there seems strange.


> +
> +	r = vfio_add_group_dev(dev, &vfio_vhost_mdev_dev_ops, m);
> +	if (r)
> +		return r;
> +
> +	mutex_init(&m->mutex);
> +	m->features = ops->get_features(mdev);
> +	m->mdev = mdev;
> +	return 0;
> +}
> +
> +static void vhost_mdev_remove(struct device *dev)
> +{
> +	struct vhost_mdev *m;
> +
> +	m = vfio_del_group_dev(dev);
> +	mutex_destroy(&m->mutex);
> +}
> +
> +static const struct mdev_class_id vhost_mdev_match[] = {
> +	{ MDEV_CLASS_ID_VHOST },
> +	{ 0 },
> +};
> +MODULE_DEVICE_TABLE(mdev, vhost_mdev_match);
> +
> +static struct mdev_driver vhost_mdev_driver = {
> +	.name	= "vhost_mdev",
> +	.probe	= vhost_mdev_probe,
> +	.remove	= vhost_mdev_remove,
> +	.id_table = vhost_mdev_match,
> +};
> +
> +static int __init vhost_mdev_init(void)
> +{
> +	return mdev_register_driver(&vhost_mdev_driver, THIS_MODULE);
> +}
> +module_init(vhost_mdev_init);
> +
> +static void __exit vhost_mdev_exit(void)
> +{
> +	mdev_unregister_driver(&vhost_mdev_driver);
> +}
> +module_exit(vhost_mdev_exit);
> +
> +MODULE_VERSION("0.0.1");
> +MODULE_LICENSE("GPL v2");
> +MODULE_DESCRIPTION("Mediated device based accelerator for virtio");
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> index 13e045e09d3b..6060cdbe6d3e 100644
> --- a/include/linux/mdev.h
> +++ b/include/linux/mdev.h
> @@ -114,6 +114,8 @@ void mdev_set_vfio_ops(struct mdev_device *mdev,
>   		       const struct vfio_mdev_device_ops *vfio_ops);
>   void mdev_set_virtio_ops(struct mdev_device *mdev,
>                            const struct virtio_mdev_device_ops *virtio_ops);
> +void mdev_set_vhost_ops(struct mdev_device *mdev,
> +			const struct virtio_mdev_device_ops *vhost_ops);
>   const void *mdev_get_dev_ops(struct mdev_device *mdev);
>   
>   extern struct bus_type mdev_bus_type;
> @@ -131,6 +133,7 @@ struct mdev_device *mdev_from_dev(struct device *dev);
>   enum {
>   	MDEV_CLASS_ID_VFIO = 1,
>   	MDEV_CLASS_ID_VIRTIO = 2,
> +	MDEV_CLASS_ID_VHOST = 3,
>   	/* New entries must be added here */
>   };
>   
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index 40d028eed645..dad3c62bd91b 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -116,4 +116,17 @@
>   #define VHOST_VSOCK_SET_GUEST_CID	_IOW(VHOST_VIRTIO, 0x60, __u64)
>   #define VHOST_VSOCK_SET_RUNNING		_IOW(VHOST_VIRTIO, 0x61, int)
>   
> +/* VHOST_MDEV specific defines */
> +
> +/* Get and set the status of the backend. The status bits follow the
> + * same definition of the device status defined in virtio-spec. */
> +#define VHOST_MDEV_GET_STATUS		_IOW(VHOST_VIRTIO, 0x70, __u8)
> +#define VHOST_MDEV_SET_STATUS		_IOW(VHOST_VIRTIO, 0x71, __u8)
> +/* Enable/disable the ring. */
> +#define VHOST_MDEV_SET_VRING_ENABLE	_IOW(VHOST_VIRTIO, 0x72, struct vhost_vring_state)
> +/* Get the max ring size. */
> +#define VHOST_MDEV_GET_VRING_NUM	_IOW(VHOST_VIRTIO, 0x73, __u16)
> +/* Get the max number of queues. */
> +#define VHOST_MDEV_GET_QUEUE_NUM	_IOW(VHOST_VIRTIO, 0x74, __u32)


Do we need API for userspace to get backend capability? (that calls 
get_mdev_device_features())

Thanks


> +
>   #endif
Tiwei Bie Oct. 23, 2019, 3:02 a.m. UTC | #2
On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
> On 2019/10/22 下午5:52, Tiwei Bie wrote:
> > This patch introduces a mdev based hardware vhost backend.
> > This backend is built on top of the same abstraction used
> > in virtio-mdev and provides a generic vhost interface for
> > userspace to accelerate the virtio devices in guest.
> > 
> > This backend is implemented as a mdev device driver on top
> > of the same mdev device ops used in virtio-mdev but using
> > a different mdev class id, and it will register the device
> > as a VFIO device for userspace to use. Userspace can setup
> > the IOMMU with the existing VFIO container/group APIs and
> > then get the device fd with the device name. After getting
> > the device fd of this device, userspace can use vhost ioctls
> > to setup the backend.
> > 
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > ---
> > This patch depends on below series:
> > https://lkml.org/lkml/2019/10/17/286
> > 
> > v1 -> v2:
> > - Replace _SET_STATE with _SET_STATUS (MST);
> > - Check status bits at each step (MST);
> > - Report the max ring size and max number of queues (MST);
> > - Add missing MODULE_DEVICE_TABLE (Jason);
> > - Only support the network backend w/o multiqueue for now;
> 
> 
> Any idea on how to extend it to support devices other than net? I think we
> want a generic API or an API that could be made generic in the future.
> 
> Do we want to e.g having a generic vhost mdev for all kinds of devices or
> introducing e.g vhost-net-mdev and vhost-scsi-mdev?

One possible way is to do what vhost-user does. I.e. Apart from
the generic ring, features, ... related ioctls, we also introduce
device specific ioctls when we need them. As vhost-mdev just needs
to forward configs between parent and userspace and even won't
cache any info when possible, I think it might be better to do
this in one generic vhost-mdev module.

> 
> 
> > - Some minor fixes and improvements;
> > - Rebase on top of virtio-mdev series v4;
> > 
> > RFC v4 -> v1:
> > - Implement vhost-mdev as a mdev device driver directly and
> >    connect it to VFIO container/group. (Jason);
> > - Pass ring addresses as GPAs/IOVAs in vhost-mdev to avoid
> >    meaningless HVA->GPA translations (Jason);
> > 
> > RFC v3 -> RFC v4:
> > - Build vhost-mdev on top of the same abstraction used by
> >    virtio-mdev (Jason);
> > - Introduce vhost fd and pass VFIO fd via SET_BACKEND ioctl (MST);
> > 
> > RFC v2 -> RFC v3:
> > - Reuse vhost's ioctls instead of inventing a VFIO regions/irqs
> >    based vhost protocol on top of vfio-mdev (Jason);
> > 
> > RFC v1 -> RFC v2:
> > - Introduce a new VFIO device type to build a vhost protocol
> >    on top of vfio-mdev;
> > 
> >   drivers/vfio/mdev/mdev_core.c |  12 +
> >   drivers/vhost/Kconfig         |   9 +
> >   drivers/vhost/Makefile        |   3 +
> >   drivers/vhost/mdev.c          | 415 ++++++++++++++++++++++++++++++++++
> >   include/linux/mdev.h          |   3 +
> >   include/uapi/linux/vhost.h    |  13 ++
> >   6 files changed, 455 insertions(+)
> >   create mode 100644 drivers/vhost/mdev.c
> > 
> > diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> > index 5834f6b7c7a5..2963f65e6648 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -69,6 +69,18 @@ void mdev_set_virtio_ops(struct mdev_device *mdev,
> >   }
> >   EXPORT_SYMBOL(mdev_set_virtio_ops);
> > +/* Specify the vhost device ops for the mdev device, this
> > + * must be called during create() callback for vhost mdev device.
> > + */
> > +void mdev_set_vhost_ops(struct mdev_device *mdev,
> > +			const struct virtio_mdev_device_ops *vhost_ops)
> > +{
> > +	WARN_ON(mdev->class_id);
> > +	mdev->class_id = MDEV_CLASS_ID_VHOST;
> > +	mdev->device_ops = vhost_ops;
> > +}
> > +EXPORT_SYMBOL(mdev_set_vhost_ops);
> > +
> >   const void *mdev_get_dev_ops(struct mdev_device *mdev)
> >   {
> >   	return mdev->device_ops;
> > diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> > index 3d03ccbd1adc..7b5c2f655af7 100644
> > --- a/drivers/vhost/Kconfig
> > +++ b/drivers/vhost/Kconfig
> > @@ -34,6 +34,15 @@ config VHOST_VSOCK
> >   	To compile this driver as a module, choose M here: the module will be called
> >   	vhost_vsock.
> > +config VHOST_MDEV
> > +	tristate "Vhost driver for Mediated devices"
> > +	depends on EVENTFD && VFIO && VFIO_MDEV
> > +	select VHOST
> > +	default n
> > +	---help---
> > +	Say M here to enable the vhost_mdev module for use with
> > +	the mediated device based hardware vhost accelerators.
> > +
> >   config VHOST
> >   	tristate
> >   	---help---
> > diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
> > index 6c6df24f770c..ad9c0f8c6d8c 100644
> > --- a/drivers/vhost/Makefile
> > +++ b/drivers/vhost/Makefile
> > @@ -10,4 +10,7 @@ vhost_vsock-y := vsock.o
> >   obj-$(CONFIG_VHOST_RING) += vringh.o
> > +obj-$(CONFIG_VHOST_MDEV) += vhost_mdev.o
> > +vhost_mdev-y := mdev.o
> > +
> >   obj-$(CONFIG_VHOST)	+= vhost.o
> > diff --git a/drivers/vhost/mdev.c b/drivers/vhost/mdev.c
> > new file mode 100644
> > index 000000000000..5f9cae61018c
> > --- /dev/null
> > +++ b/drivers/vhost/mdev.c
> > @@ -0,0 +1,415 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2018-2019 Intel Corporation.
> > + */
> > +
> > +#include <linux/compat.h>
> > +#include <linux/kernel.h>
> > +#include <linux/miscdevice.h>
> > +#include <linux/mdev.h>
> > +#include <linux/module.h>
> > +#include <linux/vfio.h>
> > +#include <linux/vhost.h>
> > +#include <linux/virtio_mdev.h>
> > +#include <linux/virtio_ids.h>
> > +
> > +#include "vhost.h"
> > +
> > +/* Currently, only network backend w/o multiqueue is supported. */
> > +#define VHOST_MDEV_VQ_MAX	2
> > +
> > +struct vhost_mdev {
> > +	/* The lock is to protect this structure. */
> > +	struct mutex mutex;
> > +	struct vhost_dev dev;
> > +	struct vhost_virtqueue *vqs;
> > +	int nvqs;
> > +	u64 status;
> > +	u64 features;
> > +	u64 acked_features;
> > +	bool opened;
> > +	struct mdev_device *mdev;
> > +};
> > +
> > +static void handle_vq_kick(struct vhost_work *work)
> > +{
> > +	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> > +						  poll.work);
> > +	struct vhost_mdev *m = container_of(vq->dev, struct vhost_mdev, dev);
> > +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
> > +
> > +	ops->kick_vq(m->mdev, vq - m->vqs);
> > +}
> > +
> > +static irqreturn_t vhost_mdev_virtqueue_cb(void *private)
> > +{
> > +	struct vhost_virtqueue *vq = private;
> > +	struct eventfd_ctx *call_ctx = vq->call_ctx;
> > +
> > +	if (call_ctx)
> > +		eventfd_signal(call_ctx, 1);
> > +	return IRQ_HANDLED;
> > +}
> > +
> > +static void vhost_mdev_reset(struct vhost_mdev *m)
> > +{
> > +	struct mdev_device *mdev = m->mdev;
> > +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev);
> > +
> > +	m->status = 0;
> > +	return ops->set_status(mdev, m->status);
> > +}
> > +
> > +static long vhost_mdev_get_status(struct vhost_mdev *m, u8 __user *statusp)
> > +{
> > +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
> > +	struct mdev_device *mdev = m->mdev;
> > +	u8 status;
> > +
> > +	status = ops->get_status(mdev);
> > +	m->status = status;
> > +
> > +	if (copy_to_user(statusp, &status, sizeof(status)))
> > +		return -EFAULT;
> > +
> > +	return 0;
> > +}
> > +
> > +static long vhost_mdev_set_status(struct vhost_mdev *m, u8 __user *statusp)
> > +{
> > +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
> > +	struct mdev_device *mdev = m->mdev;
> > +	u8 status;
> > +
> > +	if (copy_from_user(&status, statusp, sizeof(status)))
> > +		return -EFAULT;
> > +
> > +	/*
> > +	 * Userspace shouldn't remove status bits unless reset the
> > +	 * status to 0.
> > +	 */
> > +	if (status != 0 && (m->status & ~status) != 0)
> > +		return -EINVAL;
> 
> 
> We don't cache vq ready information but we cache status and features here,
> any reason for this?

+1, I think it's better to not cache any unnecessary
information in vhost-mdev.

> 
> 
> > +
> > +	ops->set_status(mdev, status);
> > +	m->status = ops->get_status(mdev);
> > +
> > +	return 0;
> > +}
> > +
> > +static long vhost_mdev_get_features(struct vhost_mdev *m, u64 __user *featurep)
> > +{
> > +	if (copy_to_user(featurep, &m->features, sizeof(m->features)))
> > +		return -EFAULT;
> 
> 
> As discussed in previous version do we need to filter out MQ feature here?

I think it's more straightforward to let the parent drivers to
filter out the unsupported features. Otherwise it would be tricky
when we want to add more features in vhost-mdev module, i.e. if
the parent drivers may expose unsupported features and relay on
vhost-mdev to filter them out, these features will be exposed
to userspace automatically when they are enabled in vhost-mdev
in the future.

> 
> 
> > +	return 0;
> > +}
> > +
> > +static long vhost_mdev_set_features(struct vhost_mdev *m, u64 __user *featurep)
> > +{
> > +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
> > +	struct mdev_device *mdev = m->mdev;
> > +	u64 features;
> > +
> > +	/*
> > +	 * It's not allowed to change the features after they have
> > +	 * been negotiated.
> > +	 */
> > +	if (m->status & VIRTIO_CONFIG_S_FEATURES_OK)
> > +		return -EPERM;
> 
> 
> -EBUSY?

Yeah, definitely.

> 
> 
> > +
> > +	if (copy_from_user(&features, featurep, sizeof(features)))
> > +		return -EFAULT;
> > +
> > +	if (features & ~m->features)
> > +		return -EINVAL;
> > +
> > +	m->acked_features = features;
> > +	if (ops->set_features(mdev, m->acked_features))
> > +		return -ENODEV;
> 
> 
> -EINVAL should be better, this would be more obvious for parent that wants
> to force any feature.

+1. Agree.

> 
> 
> > +
> > +	return 0;
> > +}
> > +
> > +static long vhost_mdev_get_vring_num(struct vhost_mdev *m, u16 __user *argp)
> > +{
> > +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
> > +	struct mdev_device *mdev = m->mdev;
> > +	u16 num;
> > +
> > +	num = ops->get_vq_num_max(mdev);
> > +
> > +	if (copy_to_user(argp, &num, sizeof(num)))
> > +		return -EFAULT;
> > +	return 0;
> > +}
> > +
> > +static long vhost_mdev_get_queue_num(struct vhost_mdev *m, u32 __user *argp)
> > +{
> > +	u32 nvqs = m->nvqs;
> > +
> > +	if (copy_to_user(argp, &nvqs, sizeof(nvqs)))
> > +		return -EFAULT;
> > +	return 0;
> > +}
> > +
> > +static long vhost_mdev_vring_ioctl(struct vhost_mdev *m, unsigned int cmd,
> > +				   void __user *argp)
> > +{
> > +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
> > +	struct mdev_device *mdev = m->mdev;
> > +	struct virtio_mdev_callback cb;
> > +	struct vhost_virtqueue *vq;
> > +	struct vhost_vring_state s;
> > +	u32 idx;
> > +	long r;
> > +
> > +	r = get_user(idx, (u32 __user *)argp);
> > +	if (r < 0)
> > +		return r;
> > +	if (idx >= m->nvqs)
> > +		return -ENOBUFS;
> > +
> > +	/*
> > +	 * It's not allowed to detect and program vqs before
> > +	 * features negotiation or after enabling driver.
> > +	 */
> > +	if (!(m->status & VIRTIO_CONFIG_S_FEATURES_OK) ||
> > +	    (m->status & VIRTIO_CONFIG_S_DRIVER_OK))
> > +		return -EPERM;
> 
> 
> So the question is: is it better to do this in parent or not?

I think it will duplicate the generic code in each parent.

> 
> 
> > +
> > +	vq = &m->vqs[idx];
> > +
> > +	if (cmd == VHOST_MDEV_SET_VRING_ENABLE) {
> > +		if (copy_from_user(&s, argp, sizeof(s)))
> > +			return -EFAULT;
> > +		ops->set_vq_ready(mdev, idx, s.num);
> > +		return 0;
> > +	}
> > +
> > +	/*
> > +	 * It's not allowed to detect and program vqs with
> > +	 * vqs enabled.
> > +	 */
> > +	if (ops->get_vq_ready(mdev, idx))
> > +		return -EPERM;
> > +
> > +	if (cmd == VHOST_GET_VRING_BASE)
> > +		vq->last_avail_idx = ops->get_vq_state(m->mdev, idx);
> > +
> > +	r = vhost_vring_ioctl(&m->dev, cmd, argp);
> > +	if (r)
> > +		return r;
> > +
> > +	switch (cmd) {
> > +	case VHOST_SET_VRING_ADDR:
> > +		/*
> > +		 * In vhost-mdev, the ring addresses set by userspace should
> > +		 * be the DMA addresses within the VFIO container/group.
> > +		 */
> > +		if (ops->set_vq_address(mdev, idx, (u64)vq->desc,
> > +					(u64)vq->avail, (u64)vq->used))
> > +			r = -ENODEV;
> > +		break;
> > +
> > +	case VHOST_SET_VRING_BASE:
> > +		if (ops->set_vq_state(mdev, idx, vq->last_avail_idx))
> > +			r = -ENODEV;
> > +		break;
> > +
> > +	case VHOST_SET_VRING_CALL:
> > +		if (vq->call_ctx) {
> > +			cb.callback = vhost_mdev_virtqueue_cb;
> > +			cb.private = vq;
> > +		} else {
> > +			cb.callback = NULL;
> > +			cb.private = NULL;
> > +		}
> > +		ops->set_vq_cb(mdev, idx, &cb);
> > +		break;
> > +
> > +	case VHOST_SET_VRING_NUM:
> > +		ops->set_vq_num(mdev, idx, vq->num);
> > +		break;
> > +	}
> > +
> > +	return r;
> > +}
> > +
> > +static int vhost_mdev_open(void *device_data)
> > +{
> > +	struct vhost_mdev *m = device_data;
> > +	struct vhost_dev *dev;
> > +	struct vhost_virtqueue **vqs;
> > +	int nvqs, i, r;
> > +
> > +	if (!try_module_get(THIS_MODULE))
> > +		return -ENODEV;
> > +
> > +	mutex_lock(&m->mutex);
> > +
> > +	if (m->opened) {
> > +		r = -EBUSY;
> > +		goto err;
> > +	}
> > +
> > +	nvqs = m->nvqs;
> > +	vhost_mdev_reset(m);
> > +
> > +	memset(&m->dev, 0, sizeof(m->dev));
> > +	memset(m->vqs, 0, nvqs * sizeof(struct vhost_virtqueue));
> > +
> > +	vqs = kmalloc_array(nvqs, sizeof(*vqs), GFP_KERNEL);
> > +	if (!vqs) {
> > +		r = -ENOMEM;
> > +		goto err;
> > +	}
> > +
> > +	dev = &m->dev;
> > +	for (i = 0; i < nvqs; i++) {
> > +		vqs[i] = &m->vqs[i];
> > +		vqs[i]->handle_kick = handle_vq_kick;
> > +	}
> > +	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0);
> > +	m->opened = true;
> > +	mutex_unlock(&m->mutex);
> > +
> > +	return 0;
> > +
> > +err:
> > +	mutex_unlock(&m->mutex);
> > +	module_put(THIS_MODULE);
> > +	return r;
> > +}
> > +
> > +static void vhost_mdev_release(void *device_data)
> > +{
> > +	struct vhost_mdev *m = device_data;
> > +
> > +	mutex_lock(&m->mutex);
> > +	vhost_mdev_reset(m);
> > +	vhost_dev_stop(&m->dev);
> > +	vhost_dev_cleanup(&m->dev);
> > +
> > +	kfree(m->dev.vqs);
> > +	m->opened = false;
> > +	mutex_unlock(&m->mutex);
> > +	module_put(THIS_MODULE);
> > +}
> > +
> > +static long vhost_mdev_unlocked_ioctl(void *device_data,
> > +				      unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vhost_mdev *m = device_data;
> > +	void __user *argp = (void __user *)arg;
> > +	long r;
> > +
> > +	mutex_lock(&m->mutex);
> > +
> > +	switch (cmd) {
> > +	case VHOST_MDEV_GET_STATUS:
> > +		r = vhost_mdev_get_status(m, argp);
> > +		break;
> > +	case VHOST_MDEV_SET_STATUS:
> > +		r = vhost_mdev_set_status(m, argp);
> > +		break;
> > +	case VHOST_GET_FEATURES:
> > +		r = vhost_mdev_get_features(m, argp);
> > +		break;
> > +	case VHOST_SET_FEATURES:
> > +		r = vhost_mdev_set_features(m, argp);
> > +		break;
> > +	case VHOST_MDEV_GET_VRING_NUM:
> > +		r = vhost_mdev_get_vring_num(m, argp);
> > +		break;
> > +	case VHOST_MDEV_GET_QUEUE_NUM:
> > +		r = vhost_mdev_get_queue_num(m, argp);
> > +		break;
> 
> 
> It's not clear to me that how this API will be used by userspace? I think
> e.g features without MQ implies the queue num here.

I was thinking about always letting _GET_QUEUE_NUM return
the supported number of queues. For virtio devices other
than virtio-net, can we always expect to have a fixed
default number of queues when there is no MQ feature?

> 
> 
> > +	default:
> > +		r = vhost_dev_ioctl(&m->dev, cmd, argp);
> 
> 
> I believe having SET_MEM_TABLE/SET_LOG_BASE/SET_LOG_FD  is for future
> support of those features. If it's true need add some comments on this.

OK.

> 
> 
> > +		if (r == -ENOIOCTLCMD)
> > +			r = vhost_mdev_vring_ioctl(m, cmd, argp);
> > +	}
> > +
> > +	mutex_unlock(&m->mutex);
> > +	return r;
> > +}
> > +
> > +static const struct vfio_device_ops vfio_vhost_mdev_dev_ops = {
> > +	.name		= "vfio-vhost-mdev",
> > +	.open		= vhost_mdev_open,
> > +	.release	= vhost_mdev_release,
> > +	.ioctl		= vhost_mdev_unlocked_ioctl,
> > +};
> > +
> > +static int vhost_mdev_probe(struct device *dev)
> > +{
> > +	struct mdev_device *mdev = mdev_from_dev(dev);
> > +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev);
> > +	struct vhost_mdev *m;
> > +	int nvqs, r;
> > +
> > +	/* Currently, only network backend is supported. */
> > +	if (ops->get_device_id(mdev) != VIRTIO_ID_NET)
> > +		return -ENOTSUPP;
> 
> 
> If we decide to go with the way of vhost-net-mdev, probably need something
> smarter. E.g a vhost bus etc.
> 
> 
> > +
> > +	if (ops->get_mdev_features(mdev) != VIRTIO_MDEV_F_VERSION_1)
> > +		return -ENOTSUPP;
> > +
> > +	m = devm_kzalloc(dev, sizeof(*m), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> > +	if (!m)
> > +		return -ENOMEM;
> > +
> > +	nvqs = VHOST_MDEV_VQ_MAX;
> > +	m->nvqs = nvqs;
> > +
> > +	m->vqs = devm_kmalloc_array(dev, nvqs, sizeof(struct vhost_virtqueue),
> > +				    GFP_KERNEL);
> > +	if (!m->vqs)
> > +		return -ENOMEM;
> 
> 
> Is it better to move those allocation to open? Otherwise the memset there
> seems strange.

OK.

> 
> 
> > +
> > +	r = vfio_add_group_dev(dev, &vfio_vhost_mdev_dev_ops, m);
> > +	if (r)
> > +		return r;
> > +
> > +	mutex_init(&m->mutex);
> > +	m->features = ops->get_features(mdev);
> > +	m->mdev = mdev;
> > +	return 0;
> > +}
> > +
> > +static void vhost_mdev_remove(struct device *dev)
> > +{
> > +	struct vhost_mdev *m;
> > +
> > +	m = vfio_del_group_dev(dev);
> > +	mutex_destroy(&m->mutex);
> > +}
> > +
> > +static const struct mdev_class_id vhost_mdev_match[] = {
> > +	{ MDEV_CLASS_ID_VHOST },
> > +	{ 0 },
> > +};
> > +MODULE_DEVICE_TABLE(mdev, vhost_mdev_match);
> > +
> > +static struct mdev_driver vhost_mdev_driver = {
> > +	.name	= "vhost_mdev",
> > +	.probe	= vhost_mdev_probe,
> > +	.remove	= vhost_mdev_remove,
> > +	.id_table = vhost_mdev_match,
> > +};
> > +
> > +static int __init vhost_mdev_init(void)
> > +{
> > +	return mdev_register_driver(&vhost_mdev_driver, THIS_MODULE);
> > +}
> > +module_init(vhost_mdev_init);
> > +
> > +static void __exit vhost_mdev_exit(void)
> > +{
> > +	mdev_unregister_driver(&vhost_mdev_driver);
> > +}
> > +module_exit(vhost_mdev_exit);
> > +
> > +MODULE_VERSION("0.0.1");
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_DESCRIPTION("Mediated device based accelerator for virtio");
> > diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> > index 13e045e09d3b..6060cdbe6d3e 100644
> > --- a/include/linux/mdev.h
> > +++ b/include/linux/mdev.h
> > @@ -114,6 +114,8 @@ void mdev_set_vfio_ops(struct mdev_device *mdev,
> >   		       const struct vfio_mdev_device_ops *vfio_ops);
> >   void mdev_set_virtio_ops(struct mdev_device *mdev,
> >                            const struct virtio_mdev_device_ops *virtio_ops);
> > +void mdev_set_vhost_ops(struct mdev_device *mdev,
> > +			const struct virtio_mdev_device_ops *vhost_ops);
> >   const void *mdev_get_dev_ops(struct mdev_device *mdev);
> >   extern struct bus_type mdev_bus_type;
> > @@ -131,6 +133,7 @@ struct mdev_device *mdev_from_dev(struct device *dev);
> >   enum {
> >   	MDEV_CLASS_ID_VFIO = 1,
> >   	MDEV_CLASS_ID_VIRTIO = 2,
> > +	MDEV_CLASS_ID_VHOST = 3,
> >   	/* New entries must be added here */
> >   };
> > diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> > index 40d028eed645..dad3c62bd91b 100644
> > --- a/include/uapi/linux/vhost.h
> > +++ b/include/uapi/linux/vhost.h
> > @@ -116,4 +116,17 @@
> >   #define VHOST_VSOCK_SET_GUEST_CID	_IOW(VHOST_VIRTIO, 0x60, __u64)
> >   #define VHOST_VSOCK_SET_RUNNING		_IOW(VHOST_VIRTIO, 0x61, int)
> > +/* VHOST_MDEV specific defines */
> > +
> > +/* Get and set the status of the backend. The status bits follow the
> > + * same definition of the device status defined in virtio-spec. */
> > +#define VHOST_MDEV_GET_STATUS		_IOW(VHOST_VIRTIO, 0x70, __u8)
> > +#define VHOST_MDEV_SET_STATUS		_IOW(VHOST_VIRTIO, 0x71, __u8)
> > +/* Enable/disable the ring. */
> > +#define VHOST_MDEV_SET_VRING_ENABLE	_IOW(VHOST_VIRTIO, 0x72, struct vhost_vring_state)
> > +/* Get the max ring size. */
> > +#define VHOST_MDEV_GET_VRING_NUM	_IOW(VHOST_VIRTIO, 0x73, __u16)
> > +/* Get the max number of queues. */
> > +#define VHOST_MDEV_GET_QUEUE_NUM	_IOW(VHOST_VIRTIO, 0x74, __u32)
> 
> 
> Do we need API for userspace to get backend capability? (that calls
> get_mdev_device_features())

Vhost already has the features and backend features ioctls.
In vhost-mdev, it might be better to still just use them to
expose the backend capability to userspace.

Thanks for the review!
Tiwei

> 
> Thanks
> 
> 
> > +
> >   #endif
>
Jason Wang Oct. 23, 2019, 5:46 a.m. UTC | #3
On 2019/10/23 上午11:02, Tiwei Bie wrote:
> On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
>> On 2019/10/22 下午5:52, Tiwei Bie wrote:
>>> This patch introduces a mdev based hardware vhost backend.
>>> This backend is built on top of the same abstraction used
>>> in virtio-mdev and provides a generic vhost interface for
>>> userspace to accelerate the virtio devices in guest.
>>>
>>> This backend is implemented as a mdev device driver on top
>>> of the same mdev device ops used in virtio-mdev but using
>>> a different mdev class id, and it will register the device
>>> as a VFIO device for userspace to use. Userspace can setup
>>> the IOMMU with the existing VFIO container/group APIs and
>>> then get the device fd with the device name. After getting
>>> the device fd of this device, userspace can use vhost ioctls
>>> to setup the backend.
>>>
>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
>>> ---
>>> This patch depends on below series:
>>> https://lkml.org/lkml/2019/10/17/286
>>>
>>> v1 -> v2:
>>> - Replace _SET_STATE with _SET_STATUS (MST);
>>> - Check status bits at each step (MST);
>>> - Report the max ring size and max number of queues (MST);
>>> - Add missing MODULE_DEVICE_TABLE (Jason);
>>> - Only support the network backend w/o multiqueue for now;
>>
>> Any idea on how to extend it to support devices other than net? I think we
>> want a generic API or an API that could be made generic in the future.
>>
>> Do we want to e.g having a generic vhost mdev for all kinds of devices or
>> introducing e.g vhost-net-mdev and vhost-scsi-mdev?
> One possible way is to do what vhost-user does. I.e. Apart from
> the generic ring, features, ... related ioctls, we also introduce
> device specific ioctls when we need them. As vhost-mdev just needs
> to forward configs between parent and userspace and even won't
> cache any info when possible,


So it looks to me this is only possible if we expose e.g set_config and 
get_config to userspace.


> I think it might be better to do
> this in one generic vhost-mdev module.


Looking at definitions of VhostUserRequest in qemu, it mixed generic API 
with device specific API. If we want go this ways (a generic 
vhost-mdev), more questions needs to be answered:

1) How could userspace know which type of vhost it would use? Do we need 
to expose virtio subsystem device in for userspace this case?

2) That generic vhost-mdev module still need to filter out unsupported 
ioctls for a specific type. E.g if it probes a net device, it should 
refuse API for other type. This in fact a vhost-mdev-net but just not 
modularize it on top of vhost-mdev.


>
>>
>>> - Some minor fixes and improvements;
>>> - Rebase on top of virtio-mdev series v4;
>>>
>>> RFC v4 -> v1:
>>> - Implement vhost-mdev as a mdev device driver directly and
>>>     connect it to VFIO container/group. (Jason);
>>> - Pass ring addresses as GPAs/IOVAs in vhost-mdev to avoid
>>>     meaningless HVA->GPA translations (Jason);
>>>
>>> RFC v3 -> RFC v4:
>>> - Build vhost-mdev on top of the same abstraction used by
>>>     virtio-mdev (Jason);
>>> - Introduce vhost fd and pass VFIO fd via SET_BACKEND ioctl (MST);
>>>
>>> RFC v2 -> RFC v3:
>>> - Reuse vhost's ioctls instead of inventing a VFIO regions/irqs
>>>     based vhost protocol on top of vfio-mdev (Jason);
>>>
>>> RFC v1 -> RFC v2:
>>> - Introduce a new VFIO device type to build a vhost protocol
>>>     on top of vfio-mdev;
>>>
>>>    drivers/vfio/mdev/mdev_core.c |  12 +
>>>    drivers/vhost/Kconfig         |   9 +
>>>    drivers/vhost/Makefile        |   3 +
>>>    drivers/vhost/mdev.c          | 415 ++++++++++++++++++++++++++++++++++
>>>    include/linux/mdev.h          |   3 +
>>>    include/uapi/linux/vhost.h    |  13 ++
>>>    6 files changed, 455 insertions(+)
>>>    create mode 100644 drivers/vhost/mdev.c
>>>
>>> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
>>> index 5834f6b7c7a5..2963f65e6648 100644
>>> --- a/drivers/vfio/mdev/mdev_core.c
>>> +++ b/drivers/vfio/mdev/mdev_core.c
>>> @@ -69,6 +69,18 @@ void mdev_set_virtio_ops(struct mdev_device *mdev,
>>>    }
>>>    EXPORT_SYMBOL(mdev_set_virtio_ops);
>>> +/* Specify the vhost device ops for the mdev device, this
>>> + * must be called during create() callback for vhost mdev device.
>>> + */
>>> +void mdev_set_vhost_ops(struct mdev_device *mdev,
>>> +			const struct virtio_mdev_device_ops *vhost_ops)
>>> +{
>>> +	WARN_ON(mdev->class_id);
>>> +	mdev->class_id = MDEV_CLASS_ID_VHOST;
>>> +	mdev->device_ops = vhost_ops;
>>> +}
>>> +EXPORT_SYMBOL(mdev_set_vhost_ops);
>>> +
>>>    const void *mdev_get_dev_ops(struct mdev_device *mdev)
>>>    {
>>>    	return mdev->device_ops;
>>> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
>>> index 3d03ccbd1adc..7b5c2f655af7 100644
>>> --- a/drivers/vhost/Kconfig
>>> +++ b/drivers/vhost/Kconfig
>>> @@ -34,6 +34,15 @@ config VHOST_VSOCK
>>>    	To compile this driver as a module, choose M here: the module will be called
>>>    	vhost_vsock.
>>> +config VHOST_MDEV
>>> +	tristate "Vhost driver for Mediated devices"
>>> +	depends on EVENTFD && VFIO && VFIO_MDEV
>>> +	select VHOST
>>> +	default n
>>> +	---help---
>>> +	Say M here to enable the vhost_mdev module for use with
>>> +	the mediated device based hardware vhost accelerators.
>>> +
>>>    config VHOST
>>>    	tristate
>>>    	---help---
>>> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
>>> index 6c6df24f770c..ad9c0f8c6d8c 100644
>>> --- a/drivers/vhost/Makefile
>>> +++ b/drivers/vhost/Makefile
>>> @@ -10,4 +10,7 @@ vhost_vsock-y := vsock.o
>>>    obj-$(CONFIG_VHOST_RING) += vringh.o
>>> +obj-$(CONFIG_VHOST_MDEV) += vhost_mdev.o
>>> +vhost_mdev-y := mdev.o
>>> +
>>>    obj-$(CONFIG_VHOST)	+= vhost.o
>>> diff --git a/drivers/vhost/mdev.c b/drivers/vhost/mdev.c
>>> new file mode 100644
>>> index 000000000000..5f9cae61018c
>>> --- /dev/null
>>> +++ b/drivers/vhost/mdev.c
>>> @@ -0,0 +1,415 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * Copyright (C) 2018-2019 Intel Corporation.
>>> + */
>>> +
>>> +#include <linux/compat.h>
>>> +#include <linux/kernel.h>
>>> +#include <linux/miscdevice.h>
>>> +#include <linux/mdev.h>
>>> +#include <linux/module.h>
>>> +#include <linux/vfio.h>
>>> +#include <linux/vhost.h>
>>> +#include <linux/virtio_mdev.h>
>>> +#include <linux/virtio_ids.h>
>>> +
>>> +#include "vhost.h"
>>> +
>>> +/* Currently, only network backend w/o multiqueue is supported. */
>>> +#define VHOST_MDEV_VQ_MAX	2
>>> +
>>> +struct vhost_mdev {
>>> +	/* The lock is to protect this structure. */
>>> +	struct mutex mutex;
>>> +	struct vhost_dev dev;
>>> +	struct vhost_virtqueue *vqs;
>>> +	int nvqs;
>>> +	u64 status;
>>> +	u64 features;
>>> +	u64 acked_features;
>>> +	bool opened;
>>> +	struct mdev_device *mdev;
>>> +};
>>> +
>>> +static void handle_vq_kick(struct vhost_work *work)
>>> +{
>>> +	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
>>> +						  poll.work);
>>> +	struct vhost_mdev *m = container_of(vq->dev, struct vhost_mdev, dev);
>>> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
>>> +
>>> +	ops->kick_vq(m->mdev, vq - m->vqs);
>>> +}
>>> +
>>> +static irqreturn_t vhost_mdev_virtqueue_cb(void *private)
>>> +{
>>> +	struct vhost_virtqueue *vq = private;
>>> +	struct eventfd_ctx *call_ctx = vq->call_ctx;
>>> +
>>> +	if (call_ctx)
>>> +		eventfd_signal(call_ctx, 1);
>>> +	return IRQ_HANDLED;
>>> +}
>>> +
>>> +static void vhost_mdev_reset(struct vhost_mdev *m)
>>> +{
>>> +	struct mdev_device *mdev = m->mdev;
>>> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev);
>>> +
>>> +	m->status = 0;
>>> +	return ops->set_status(mdev, m->status);
>>> +}
>>> +
>>> +static long vhost_mdev_get_status(struct vhost_mdev *m, u8 __user *statusp)
>>> +{
>>> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
>>> +	struct mdev_device *mdev = m->mdev;
>>> +	u8 status;
>>> +
>>> +	status = ops->get_status(mdev);
>>> +	m->status = status;
>>> +
>>> +	if (copy_to_user(statusp, &status, sizeof(status)))
>>> +		return -EFAULT;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static long vhost_mdev_set_status(struct vhost_mdev *m, u8 __user *statusp)
>>> +{
>>> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
>>> +	struct mdev_device *mdev = m->mdev;
>>> +	u8 status;
>>> +
>>> +	if (copy_from_user(&status, statusp, sizeof(status)))
>>> +		return -EFAULT;
>>> +
>>> +	/*
>>> +	 * Userspace shouldn't remove status bits unless reset the
>>> +	 * status to 0.
>>> +	 */
>>> +	if (status != 0 && (m->status & ~status) != 0)
>>> +		return -EINVAL;
>>
>> We don't cache vq ready information but we cache status and features here,
>> any reason for this?
> +1, I think it's better to not cache any unnecessary
> information in vhost-mdev.
>
>>
>>> +
>>> +	ops->set_status(mdev, status);
>>> +	m->status = ops->get_status(mdev);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static long vhost_mdev_get_features(struct vhost_mdev *m, u64 __user *featurep)
>>> +{
>>> +	if (copy_to_user(featurep, &m->features, sizeof(m->features)))
>>> +		return -EFAULT;
>>
>> As discussed in previous version do we need to filter out MQ feature here?
> I think it's more straightforward to let the parent drivers to
> filter out the unsupported features. Otherwise it would be tricky
> when we want to add more features in vhost-mdev module,


It's as simple as remove the feature from blacklist?


> i.e. if
> the parent drivers may expose unsupported features and relay on
> vhost-mdev to filter them out, these features will be exposed
> to userspace automatically when they are enabled in vhost-mdev
> in the future.


The issue is, it's only that vhost-mdev knows its own limitation. E.g in 
this patch, vhost-mdev only implements a subset of transport API, but 
parent doesn't know about that.

Still MQ as an example, there's no way (or no need) for parent to know 
that vhost-mdev does not support MQ. And this allows old kenrel to work 
with new parent drivers.

So basically we have three choices here:

1) Implement what vhost-user did and implement a generic vhost-mdev (but 
may still have lots of device specific code). To support advanced 
feature which requires the access to config, still lots of API that 
needs to be added.

2) Implement what vhost-kernel did, have a generic vhost-mdev driver and 
a vhost bus on top for match a device specific API e.g vhost-mdev-net. 
We still have device specific API but limit them only to device specific 
module. Still require new ioctls for advanced feature like MQ.

3) Simply expose all virtio-mdev transport to userspace. A generic 
module without any type specific code (like virtio-mdev). No need 
dedicated API for e.g MQ. But then the API will look much different than 
current vhost did.

Consider the limitation of 1) I tend to choose 2 or 3. What's you opinion?


>
>>
>>> +	return 0;
>>> +}
>>> +
>>> +static long vhost_mdev_set_features(struct vhost_mdev *m, u64 __user *featurep)
>>> +{
>>> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
>>> +	struct mdev_device *mdev = m->mdev;
>>> +	u64 features;
>>> +
>>> +	/*
>>> +	 * It's not allowed to change the features after they have
>>> +	 * been negotiated.
>>> +	 */
>>> +	if (m->status & VIRTIO_CONFIG_S_FEATURES_OK)
>>> +		return -EPERM;
>>
>> -EBUSY?
> Yeah, definitely.
>
>>
>>> +
>>> +	if (copy_from_user(&features, featurep, sizeof(features)))
>>> +		return -EFAULT;
>>> +
>>> +	if (features & ~m->features)
>>> +		return -EINVAL;
>>> +
>>> +	m->acked_features = features;
>>> +	if (ops->set_features(mdev, m->acked_features))
>>> +		return -ENODEV;
>>
>> -EINVAL should be better, this would be more obvious for parent that wants
>> to force any feature.
> +1. Agree.
>
>>
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static long vhost_mdev_get_vring_num(struct vhost_mdev *m, u16 __user *argp)
>>> +{
>>> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
>>> +	struct mdev_device *mdev = m->mdev;
>>> +	u16 num;
>>> +
>>> +	num = ops->get_vq_num_max(mdev);
>>> +
>>> +	if (copy_to_user(argp, &num, sizeof(num)))
>>> +		return -EFAULT;
>>> +	return 0;
>>> +}
>>> +
>>> +static long vhost_mdev_get_queue_num(struct vhost_mdev *m, u32 __user *argp)
>>> +{
>>> +	u32 nvqs = m->nvqs;
>>> +
>>> +	if (copy_to_user(argp, &nvqs, sizeof(nvqs)))
>>> +		return -EFAULT;
>>> +	return 0;
>>> +}
>>> +
>>> +static long vhost_mdev_vring_ioctl(struct vhost_mdev *m, unsigned int cmd,
>>> +				   void __user *argp)
>>> +{
>>> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
>>> +	struct mdev_device *mdev = m->mdev;
>>> +	struct virtio_mdev_callback cb;
>>> +	struct vhost_virtqueue *vq;
>>> +	struct vhost_vring_state s;
>>> +	u32 idx;
>>> +	long r;
>>> +
>>> +	r = get_user(idx, (u32 __user *)argp);
>>> +	if (r < 0)
>>> +		return r;
>>> +	if (idx >= m->nvqs)
>>> +		return -ENOBUFS;
>>> +
>>> +	/*
>>> +	 * It's not allowed to detect and program vqs before
>>> +	 * features negotiation or after enabling driver.
>>> +	 */
>>> +	if (!(m->status & VIRTIO_CONFIG_S_FEATURES_OK) ||
>>> +	    (m->status & VIRTIO_CONFIG_S_DRIVER_OK))
>>> +		return -EPERM;
>>
>> So the question is: is it better to do this in parent or not?
> I think it will duplicate the generic code in each parent.
>
>>
>>> +
>>> +	vq = &m->vqs[idx];
>>> +
>>> +	if (cmd == VHOST_MDEV_SET_VRING_ENABLE) {
>>> +		if (copy_from_user(&s, argp, sizeof(s)))
>>> +			return -EFAULT;
>>> +		ops->set_vq_ready(mdev, idx, s.num);
>>> +		return 0;
>>> +	}
>>> +
>>> +	/*
>>> +	 * It's not allowed to detect and program vqs with
>>> +	 * vqs enabled.
>>> +	 */
>>> +	if (ops->get_vq_ready(mdev, idx))
>>> +		return -EPERM;
>>> +
>>> +	if (cmd == VHOST_GET_VRING_BASE)
>>> +		vq->last_avail_idx = ops->get_vq_state(m->mdev, idx);
>>> +
>>> +	r = vhost_vring_ioctl(&m->dev, cmd, argp);
>>> +	if (r)
>>> +		return r;
>>> +
>>> +	switch (cmd) {
>>> +	case VHOST_SET_VRING_ADDR:
>>> +		/*
>>> +		 * In vhost-mdev, the ring addresses set by userspace should
>>> +		 * be the DMA addresses within the VFIO container/group.
>>> +		 */
>>> +		if (ops->set_vq_address(mdev, idx, (u64)vq->desc,
>>> +					(u64)vq->avail, (u64)vq->used))
>>> +			r = -ENODEV;
>>> +		break;
>>> +
>>> +	case VHOST_SET_VRING_BASE:
>>> +		if (ops->set_vq_state(mdev, idx, vq->last_avail_idx))
>>> +			r = -ENODEV;
>>> +		break;
>>> +
>>> +	case VHOST_SET_VRING_CALL:
>>> +		if (vq->call_ctx) {
>>> +			cb.callback = vhost_mdev_virtqueue_cb;
>>> +			cb.private = vq;
>>> +		} else {
>>> +			cb.callback = NULL;
>>> +			cb.private = NULL;
>>> +		}
>>> +		ops->set_vq_cb(mdev, idx, &cb);
>>> +		break;
>>> +
>>> +	case VHOST_SET_VRING_NUM:
>>> +		ops->set_vq_num(mdev, idx, vq->num);
>>> +		break;
>>> +	}
>>> +
>>> +	return r;
>>> +}
>>> +
>>> +static int vhost_mdev_open(void *device_data)
>>> +{
>>> +	struct vhost_mdev *m = device_data;
>>> +	struct vhost_dev *dev;
>>> +	struct vhost_virtqueue **vqs;
>>> +	int nvqs, i, r;
>>> +
>>> +	if (!try_module_get(THIS_MODULE))
>>> +		return -ENODEV;
>>> +
>>> +	mutex_lock(&m->mutex);
>>> +
>>> +	if (m->opened) {
>>> +		r = -EBUSY;
>>> +		goto err;
>>> +	}
>>> +
>>> +	nvqs = m->nvqs;
>>> +	vhost_mdev_reset(m);
>>> +
>>> +	memset(&m->dev, 0, sizeof(m->dev));
>>> +	memset(m->vqs, 0, nvqs * sizeof(struct vhost_virtqueue));
>>> +
>>> +	vqs = kmalloc_array(nvqs, sizeof(*vqs), GFP_KERNEL);
>>> +	if (!vqs) {
>>> +		r = -ENOMEM;
>>> +		goto err;
>>> +	}
>>> +
>>> +	dev = &m->dev;
>>> +	for (i = 0; i < nvqs; i++) {
>>> +		vqs[i] = &m->vqs[i];
>>> +		vqs[i]->handle_kick = handle_vq_kick;
>>> +	}
>>> +	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0);
>>> +	m->opened = true;
>>> +	mutex_unlock(&m->mutex);
>>> +
>>> +	return 0;
>>> +
>>> +err:
>>> +	mutex_unlock(&m->mutex);
>>> +	module_put(THIS_MODULE);
>>> +	return r;
>>> +}
>>> +
>>> +static void vhost_mdev_release(void *device_data)
>>> +{
>>> +	struct vhost_mdev *m = device_data;
>>> +
>>> +	mutex_lock(&m->mutex);
>>> +	vhost_mdev_reset(m);
>>> +	vhost_dev_stop(&m->dev);
>>> +	vhost_dev_cleanup(&m->dev);
>>> +
>>> +	kfree(m->dev.vqs);
>>> +	m->opened = false;
>>> +	mutex_unlock(&m->mutex);
>>> +	module_put(THIS_MODULE);
>>> +}
>>> +
>>> +static long vhost_mdev_unlocked_ioctl(void *device_data,
>>> +				      unsigned int cmd, unsigned long arg)
>>> +{
>>> +	struct vhost_mdev *m = device_data;
>>> +	void __user *argp = (void __user *)arg;
>>> +	long r;
>>> +
>>> +	mutex_lock(&m->mutex);
>>> +
>>> +	switch (cmd) {
>>> +	case VHOST_MDEV_GET_STATUS:
>>> +		r = vhost_mdev_get_status(m, argp);
>>> +		break;
>>> +	case VHOST_MDEV_SET_STATUS:
>>> +		r = vhost_mdev_set_status(m, argp);
>>> +		break;
>>> +	case VHOST_GET_FEATURES:
>>> +		r = vhost_mdev_get_features(m, argp);
>>> +		break;
>>> +	case VHOST_SET_FEATURES:
>>> +		r = vhost_mdev_set_features(m, argp);
>>> +		break;
>>> +	case VHOST_MDEV_GET_VRING_NUM:
>>> +		r = vhost_mdev_get_vring_num(m, argp);
>>> +		break;
>>> +	case VHOST_MDEV_GET_QUEUE_NUM:
>>> +		r = vhost_mdev_get_queue_num(m, argp);
>>> +		break;
>>
>> It's not clear to me that how this API will be used by userspace? I think
>> e.g features without MQ implies the queue num here.
> I was thinking about always letting _GET_QUEUE_NUM return
> the supported number of queues. For virtio devices other
> than virtio-net, can we always expect to have a fixed
> default number of queues when there is no MQ feature?


It could be very tricky since each type has its own MQ feature, you 
probably want a map between device id and its default queue num. But if 
we don't support MQ, userspace should follow the #queue that is defined 
by spec. So I still don't see value for this API.

In the future, consider we want to support multiqueue, it's still much 
tricky than exporting device config space to userspace.


>
>>
>>> +	default:
>>> +		r = vhost_dev_ioctl(&m->dev, cmd, argp);
>>
>> I believe having SET_MEM_TABLE/SET_LOG_BASE/SET_LOG_FD  is for future
>> support of those features. If it's true need add some comments on this.
> OK.
>
>>
>>> +		if (r == -ENOIOCTLCMD)
>>> +			r = vhost_mdev_vring_ioctl(m, cmd, argp);
>>> +	}
>>> +
>>> +	mutex_unlock(&m->mutex);
>>> +	return r;
>>> +}
>>> +
>>> +static const struct vfio_device_ops vfio_vhost_mdev_dev_ops = {
>>> +	.name		= "vfio-vhost-mdev",
>>> +	.open		= vhost_mdev_open,
>>> +	.release	= vhost_mdev_release,
>>> +	.ioctl		= vhost_mdev_unlocked_ioctl,
>>> +};
>>> +
>>> +static int vhost_mdev_probe(struct device *dev)
>>> +{
>>> +	struct mdev_device *mdev = mdev_from_dev(dev);
>>> +	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev);
>>> +	struct vhost_mdev *m;
>>> +	int nvqs, r;
>>> +
>>> +	/* Currently, only network backend is supported. */
>>> +	if (ops->get_device_id(mdev) != VIRTIO_ID_NET)
>>> +		return -ENOTSUPP;
>>
>> If we decide to go with the way of vhost-net-mdev, probably need something
>> smarter. E.g a vhost bus etc.
>>
>>
>>> +
>>> +	if (ops->get_mdev_features(mdev) != VIRTIO_MDEV_F_VERSION_1)
>>> +		return -ENOTSUPP;
>>> +
>>> +	m = devm_kzalloc(dev, sizeof(*m), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>>> +	if (!m)
>>> +		return -ENOMEM;
>>> +
>>> +	nvqs = VHOST_MDEV_VQ_MAX;
>>> +	m->nvqs = nvqs;
>>> +
>>> +	m->vqs = devm_kmalloc_array(dev, nvqs, sizeof(struct vhost_virtqueue),
>>> +				    GFP_KERNEL);
>>> +	if (!m->vqs)
>>> +		return -ENOMEM;
>>
>> Is it better to move those allocation to open? Otherwise the memset there
>> seems strange.
> OK.
>
>>
>>> +
>>> +	r = vfio_add_group_dev(dev, &vfio_vhost_mdev_dev_ops, m);
>>> +	if (r)
>>> +		return r;
>>> +
>>> +	mutex_init(&m->mutex);
>>> +	m->features = ops->get_features(mdev);
>>> +	m->mdev = mdev;
>>> +	return 0;
>>> +}
>>> +
>>> +static void vhost_mdev_remove(struct device *dev)
>>> +{
>>> +	struct vhost_mdev *m;
>>> +
>>> +	m = vfio_del_group_dev(dev);
>>> +	mutex_destroy(&m->mutex);
>>> +}
>>> +
>>> +static const struct mdev_class_id vhost_mdev_match[] = {
>>> +	{ MDEV_CLASS_ID_VHOST },
>>> +	{ 0 },
>>> +};
>>> +MODULE_DEVICE_TABLE(mdev, vhost_mdev_match);
>>> +
>>> +static struct mdev_driver vhost_mdev_driver = {
>>> +	.name	= "vhost_mdev",
>>> +	.probe	= vhost_mdev_probe,
>>> +	.remove	= vhost_mdev_remove,
>>> +	.id_table = vhost_mdev_match,
>>> +};
>>> +
>>> +static int __init vhost_mdev_init(void)
>>> +{
>>> +	return mdev_register_driver(&vhost_mdev_driver, THIS_MODULE);
>>> +}
>>> +module_init(vhost_mdev_init);
>>> +
>>> +static void __exit vhost_mdev_exit(void)
>>> +{
>>> +	mdev_unregister_driver(&vhost_mdev_driver);
>>> +}
>>> +module_exit(vhost_mdev_exit);
>>> +
>>> +MODULE_VERSION("0.0.1");
>>> +MODULE_LICENSE("GPL v2");
>>> +MODULE_DESCRIPTION("Mediated device based accelerator for virtio");
>>> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
>>> index 13e045e09d3b..6060cdbe6d3e 100644
>>> --- a/include/linux/mdev.h
>>> +++ b/include/linux/mdev.h
>>> @@ -114,6 +114,8 @@ void mdev_set_vfio_ops(struct mdev_device *mdev,
>>>    		       const struct vfio_mdev_device_ops *vfio_ops);
>>>    void mdev_set_virtio_ops(struct mdev_device *mdev,
>>>                             const struct virtio_mdev_device_ops *virtio_ops);
>>> +void mdev_set_vhost_ops(struct mdev_device *mdev,
>>> +			const struct virtio_mdev_device_ops *vhost_ops);
>>>    const void *mdev_get_dev_ops(struct mdev_device *mdev);
>>>    extern struct bus_type mdev_bus_type;
>>> @@ -131,6 +133,7 @@ struct mdev_device *mdev_from_dev(struct device *dev);
>>>    enum {
>>>    	MDEV_CLASS_ID_VFIO = 1,
>>>    	MDEV_CLASS_ID_VIRTIO = 2,
>>> +	MDEV_CLASS_ID_VHOST = 3,
>>>    	/* New entries must be added here */
>>>    };
>>> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
>>> index 40d028eed645..dad3c62bd91b 100644
>>> --- a/include/uapi/linux/vhost.h
>>> +++ b/include/uapi/linux/vhost.h
>>> @@ -116,4 +116,17 @@
>>>    #define VHOST_VSOCK_SET_GUEST_CID	_IOW(VHOST_VIRTIO, 0x60, __u64)
>>>    #define VHOST_VSOCK_SET_RUNNING		_IOW(VHOST_VIRTIO, 0x61, int)
>>> +/* VHOST_MDEV specific defines */
>>> +
>>> +/* Get and set the status of the backend. The status bits follow the
>>> + * same definition of the device status defined in virtio-spec. */
>>> +#define VHOST_MDEV_GET_STATUS		_IOW(VHOST_VIRTIO, 0x70, __u8)
>>> +#define VHOST_MDEV_SET_STATUS		_IOW(VHOST_VIRTIO, 0x71, __u8)
>>> +/* Enable/disable the ring. */
>>> +#define VHOST_MDEV_SET_VRING_ENABLE	_IOW(VHOST_VIRTIO, 0x72, struct vhost_vring_state)
>>> +/* Get the max ring size. */
>>> +#define VHOST_MDEV_GET_VRING_NUM	_IOW(VHOST_VIRTIO, 0x73, __u16)
>>> +/* Get the max number of queues. */
>>> +#define VHOST_MDEV_GET_QUEUE_NUM	_IOW(VHOST_VIRTIO, 0x74, __u32)
>>
>> Do we need API for userspace to get backend capability? (that calls
>> get_mdev_device_features())
> Vhost already has the features and backend features ioctls.


Good point, we probably need a new bit for _F_LOG_ALL. But question 
still, how vhost-mdev know whether the parent support dirty page tracking?

Thanks


> In vhost-mdev, it might be better to still just use them to
> expose the backend capability to userspace.
>
> Thanks for the review!
> Tiwei
>
>> Thanks
>>
>>
>>> +
>>>    #endif
Tiwei Bie Oct. 23, 2019, 7:07 a.m. UTC | #4
On Wed, Oct 23, 2019 at 01:46:23PM +0800, Jason Wang wrote:
> On 2019/10/23 上午11:02, Tiwei Bie wrote:
> > On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
> > > On 2019/10/22 下午5:52, Tiwei Bie wrote:
> > > > This patch introduces a mdev based hardware vhost backend.
> > > > This backend is built on top of the same abstraction used
> > > > in virtio-mdev and provides a generic vhost interface for
> > > > userspace to accelerate the virtio devices in guest.
> > > > 
> > > > This backend is implemented as a mdev device driver on top
> > > > of the same mdev device ops used in virtio-mdev but using
> > > > a different mdev class id, and it will register the device
> > > > as a VFIO device for userspace to use. Userspace can setup
> > > > the IOMMU with the existing VFIO container/group APIs and
> > > > then get the device fd with the device name. After getting
> > > > the device fd of this device, userspace can use vhost ioctls
> > > > to setup the backend.
> > > > 
> > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > ---
> > > > This patch depends on below series:
> > > > https://lkml.org/lkml/2019/10/17/286
> > > > 
> > > > v1 -> v2:
> > > > - Replace _SET_STATE with _SET_STATUS (MST);
> > > > - Check status bits at each step (MST);
> > > > - Report the max ring size and max number of queues (MST);
> > > > - Add missing MODULE_DEVICE_TABLE (Jason);
> > > > - Only support the network backend w/o multiqueue for now;
> > > 
> > > Any idea on how to extend it to support devices other than net? I think we
> > > want a generic API or an API that could be made generic in the future.
> > > 
> > > Do we want to e.g having a generic vhost mdev for all kinds of devices or
> > > introducing e.g vhost-net-mdev and vhost-scsi-mdev?
> > One possible way is to do what vhost-user does. I.e. Apart from
> > the generic ring, features, ... related ioctls, we also introduce
> > device specific ioctls when we need them. As vhost-mdev just needs
> > to forward configs between parent and userspace and even won't
> > cache any info when possible,
> 
> 
> So it looks to me this is only possible if we expose e.g set_config and
> get_config to userspace.

The set_config and get_config interface isn't really everything
of device specific settings. We also have ctrlq in virtio-net.

> 
> 
> > I think it might be better to do
> > this in one generic vhost-mdev module.
> 
> 
> Looking at definitions of VhostUserRequest in qemu, it mixed generic API
> with device specific API. If we want go this ways (a generic vhost-mdev),
> more questions needs to be answered:
> 
> 1) How could userspace know which type of vhost it would use? Do we need to
> expose virtio subsystem device in for userspace this case?
> 
> 2) That generic vhost-mdev module still need to filter out unsupported
> ioctls for a specific type. E.g if it probes a net device, it should refuse
> API for other type. This in fact a vhost-mdev-net but just not modularize it
> on top of vhost-mdev.
> 
> 
> > 
> > > 
> > > > - Some minor fixes and improvements;
> > > > - Rebase on top of virtio-mdev series v4;
[...]
> > > > +
> > > > +static long vhost_mdev_get_features(struct vhost_mdev *m, u64 __user *featurep)
> > > > +{
> > > > +	if (copy_to_user(featurep, &m->features, sizeof(m->features)))
> > > > +		return -EFAULT;
> > > 
> > > As discussed in previous version do we need to filter out MQ feature here?
> > I think it's more straightforward to let the parent drivers to
> > filter out the unsupported features. Otherwise it would be tricky
> > when we want to add more features in vhost-mdev module,
> 
> 
> It's as simple as remove the feature from blacklist?

It's not really that easy. It may break the old drivers.

> 
> 
> > i.e. if
> > the parent drivers may expose unsupported features and relay on
> > vhost-mdev to filter them out, these features will be exposed
> > to userspace automatically when they are enabled in vhost-mdev
> > in the future.
> 
> 
> The issue is, it's only that vhost-mdev knows its own limitation. E.g in
> this patch, vhost-mdev only implements a subset of transport API, but parent
> doesn't know about that.
> 
> Still MQ as an example, there's no way (or no need) for parent to know that
> vhost-mdev does not support MQ.

The mdev is a MDEV_CLASS_ID_VHOST mdev device. When the parent
is being developed, it should know the currently supported features
of vhost-mdev.

> And this allows old kenrel to work with new
> parent drivers.

The new drivers should provide things like VIRTIO_MDEV_F_VERSION_1
to be compatible with the old kernels. When VIRTIO_MDEV_F_VERSION_1
is provided/negotiated, the behaviours should be consistent.

> 
> So basically we have three choices here:
> 
> 1) Implement what vhost-user did and implement a generic vhost-mdev (but may
> still have lots of device specific code). To support advanced feature which
> requires the access to config, still lots of API that needs to be added.
> 
> 2) Implement what vhost-kernel did, have a generic vhost-mdev driver and a
> vhost bus on top for match a device specific API e.g vhost-mdev-net. We
> still have device specific API but limit them only to device specific
> module. Still require new ioctls for advanced feature like MQ.
> 
> 3) Simply expose all virtio-mdev transport to userspace.

Currently, virtio-mdev transport is a set of function callbacks
defined in kernel. How to simply expose virtio-mdev transport to
userspace?


> A generic module
> without any type specific code (like virtio-mdev). No need dedicated API for
> e.g MQ. But then the API will look much different than current vhost did.
> 
> Consider the limitation of 1) I tend to choose 2 or 3. What's you opinion?
> 
>
Jason Wang Oct. 23, 2019, 7:25 a.m. UTC | #5
On 2019/10/23 下午3:07, Tiwei Bie wrote:
> On Wed, Oct 23, 2019 at 01:46:23PM +0800, Jason Wang wrote:
>> On 2019/10/23 上午11:02, Tiwei Bie wrote:
>>> On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
>>>> On 2019/10/22 下午5:52, Tiwei Bie wrote:
>>>>> This patch introduces a mdev based hardware vhost backend.
>>>>> This backend is built on top of the same abstraction used
>>>>> in virtio-mdev and provides a generic vhost interface for
>>>>> userspace to accelerate the virtio devices in guest.
>>>>>
>>>>> This backend is implemented as a mdev device driver on top
>>>>> of the same mdev device ops used in virtio-mdev but using
>>>>> a different mdev class id, and it will register the device
>>>>> as a VFIO device for userspace to use. Userspace can setup
>>>>> the IOMMU with the existing VFIO container/group APIs and
>>>>> then get the device fd with the device name. After getting
>>>>> the device fd of this device, userspace can use vhost ioctls
>>>>> to setup the backend.
>>>>>
>>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
>>>>> ---
>>>>> This patch depends on below series:
>>>>> https://lkml.org/lkml/2019/10/17/286
>>>>>
>>>>> v1 -> v2:
>>>>> - Replace _SET_STATE with _SET_STATUS (MST);
>>>>> - Check status bits at each step (MST);
>>>>> - Report the max ring size and max number of queues (MST);
>>>>> - Add missing MODULE_DEVICE_TABLE (Jason);
>>>>> - Only support the network backend w/o multiqueue for now;
>>>> Any idea on how to extend it to support devices other than net? I think we
>>>> want a generic API or an API that could be made generic in the future.
>>>>
>>>> Do we want to e.g having a generic vhost mdev for all kinds of devices or
>>>> introducing e.g vhost-net-mdev and vhost-scsi-mdev?
>>> One possible way is to do what vhost-user does. I.e. Apart from
>>> the generic ring, features, ... related ioctls, we also introduce
>>> device specific ioctls when we need them. As vhost-mdev just needs
>>> to forward configs between parent and userspace and even won't
>>> cache any info when possible,
>>
>> So it looks to me this is only possible if we expose e.g set_config and
>> get_config to userspace.
> The set_config and get_config interface isn't really everything
> of device specific settings. We also have ctrlq in virtio-net.


Yes, but it could be processed by the exist API. Isn't it? Just set ctrl 
vq address and let parent to deal with that.


>
>>
>>> I think it might be better to do
>>> this in one generic vhost-mdev module.
>>
>> Looking at definitions of VhostUserRequest in qemu, it mixed generic API
>> with device specific API. If we want go this ways (a generic vhost-mdev),
>> more questions needs to be answered:
>>
>> 1) How could userspace know which type of vhost it would use? Do we need to
>> expose virtio subsystem device in for userspace this case?
>>
>> 2) That generic vhost-mdev module still need to filter out unsupported
>> ioctls for a specific type. E.g if it probes a net device, it should refuse
>> API for other type. This in fact a vhost-mdev-net but just not modularize it
>> on top of vhost-mdev.
>>
>>
>>>>> - Some minor fixes and improvements;
>>>>> - Rebase on top of virtio-mdev series v4;
> [...]
>>>>> +
>>>>> +static long vhost_mdev_get_features(struct vhost_mdev *m, u64 __user *featurep)
>>>>> +{
>>>>> +	if (copy_to_user(featurep, &m->features, sizeof(m->features)))
>>>>> +		return -EFAULT;
>>>> As discussed in previous version do we need to filter out MQ feature here?
>>> I think it's more straightforward to let the parent drivers to
>>> filter out the unsupported features. Otherwise it would be tricky
>>> when we want to add more features in vhost-mdev module,
>>
>> It's as simple as remove the feature from blacklist?
> It's not really that easy. It may break the old drivers.


I'm not sure I understand here, we do feature negotiation anyhow. For 
old drivers do you mean the guest drivers without MQ?


>
>>
>>> i.e. if
>>> the parent drivers may expose unsupported features and relay on
>>> vhost-mdev to filter them out, these features will be exposed
>>> to userspace automatically when they are enabled in vhost-mdev
>>> in the future.
>>
>> The issue is, it's only that vhost-mdev knows its own limitation. E.g in
>> this patch, vhost-mdev only implements a subset of transport API, but parent
>> doesn't know about that.
>>
>> Still MQ as an example, there's no way (or no need) for parent to know that
>> vhost-mdev does not support MQ.
> The mdev is a MDEV_CLASS_ID_VHOST mdev device. When the parent
> is being developed, it should know the currently supported features
> of vhost-mdev.


How can parent know MQ is not supported by vhost-mdev?


>
>> And this allows old kenrel to work with new
>> parent drivers.
> The new drivers should provide things like VIRTIO_MDEV_F_VERSION_1
> to be compatible with the old kernels. When VIRTIO_MDEV_F_VERSION_1
> is provided/negotiated, the behaviours should be consistent.


To be clear, I didn't mean a change in virtio-mdev API, I meant:

1) old vhost-mdev kernel driver that filters out MQ

2) new parent driver that support MQ


>
>> So basically we have three choices here:
>>
>> 1) Implement what vhost-user did and implement a generic vhost-mdev (but may
>> still have lots of device specific code). To support advanced feature which
>> requires the access to config, still lots of API that needs to be added.
>>
>> 2) Implement what vhost-kernel did, have a generic vhost-mdev driver and a
>> vhost bus on top for match a device specific API e.g vhost-mdev-net. We
>> still have device specific API but limit them only to device specific
>> module. Still require new ioctls for advanced feature like MQ.
>>
>> 3) Simply expose all virtio-mdev transport to userspace.
> Currently, virtio-mdev transport is a set of function callbacks
> defined in kernel. How to simply expose virtio-mdev transport to
> userspace?


The most straightforward way is to have an 1:1 mapping between ioctl and 
virito_mdev_device_ops.

Thanks


>
>
>> A generic module
>> without any type specific code (like virtio-mdev). No need dedicated API for
>> e.g MQ. But then the API will look much different than current vhost did.
>>
>> Consider the limitation of 1) I tend to choose 2 or 3. What's you opinion?
>>
>>
Tiwei Bie Oct. 23, 2019, 10:11 a.m. UTC | #6
On Wed, Oct 23, 2019 at 03:25:00PM +0800, Jason Wang wrote:
> On 2019/10/23 下午3:07, Tiwei Bie wrote:
> > On Wed, Oct 23, 2019 at 01:46:23PM +0800, Jason Wang wrote:
> > > On 2019/10/23 上午11:02, Tiwei Bie wrote:
> > > > On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
> > > > > On 2019/10/22 下午5:52, Tiwei Bie wrote:
> > > > > > This patch introduces a mdev based hardware vhost backend.
> > > > > > This backend is built on top of the same abstraction used
> > > > > > in virtio-mdev and provides a generic vhost interface for
> > > > > > userspace to accelerate the virtio devices in guest.
> > > > > > 
> > > > > > This backend is implemented as a mdev device driver on top
> > > > > > of the same mdev device ops used in virtio-mdev but using
> > > > > > a different mdev class id, and it will register the device
> > > > > > as a VFIO device for userspace to use. Userspace can setup
> > > > > > the IOMMU with the existing VFIO container/group APIs and
> > > > > > then get the device fd with the device name. After getting
> > > > > > the device fd of this device, userspace can use vhost ioctls
> > > > > > to setup the backend.
> > > > > > 
> > > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > > ---
> > > > > > This patch depends on below series:
> > > > > > https://lkml.org/lkml/2019/10/17/286
> > > > > > 
> > > > > > v1 -> v2:
> > > > > > - Replace _SET_STATE with _SET_STATUS (MST);
> > > > > > - Check status bits at each step (MST);
> > > > > > - Report the max ring size and max number of queues (MST);
> > > > > > - Add missing MODULE_DEVICE_TABLE (Jason);
> > > > > > - Only support the network backend w/o multiqueue for now;
> > > > > Any idea on how to extend it to support devices other than net? I think we
> > > > > want a generic API or an API that could be made generic in the future.
> > > > > 
> > > > > Do we want to e.g having a generic vhost mdev for all kinds of devices or
> > > > > introducing e.g vhost-net-mdev and vhost-scsi-mdev?
> > > > One possible way is to do what vhost-user does. I.e. Apart from
> > > > the generic ring, features, ... related ioctls, we also introduce
> > > > device specific ioctls when we need them. As vhost-mdev just needs
> > > > to forward configs between parent and userspace and even won't
> > > > cache any info when possible,
> > > 
> > > So it looks to me this is only possible if we expose e.g set_config and
> > > get_config to userspace.
> > The set_config and get_config interface isn't really everything
> > of device specific settings. We also have ctrlq in virtio-net.
> 
> 
> Yes, but it could be processed by the exist API. Isn't it? Just set ctrl vq
> address and let parent to deal with that.

I mean how to expose ctrlq related settings to userspace?

> 
> 
> > 
> > > 
> > > > I think it might be better to do
> > > > this in one generic vhost-mdev module.
> > > 
> > > Looking at definitions of VhostUserRequest in qemu, it mixed generic API
> > > with device specific API. If we want go this ways (a generic vhost-mdev),
> > > more questions needs to be answered:
> > > 
> > > 1) How could userspace know which type of vhost it would use? Do we need to
> > > expose virtio subsystem device in for userspace this case?
> > > 
> > > 2) That generic vhost-mdev module still need to filter out unsupported
> > > ioctls for a specific type. E.g if it probes a net device, it should refuse
> > > API for other type. This in fact a vhost-mdev-net but just not modularize it
> > > on top of vhost-mdev.
> > > 
> > > 
> > > > > > - Some minor fixes and improvements;
> > > > > > - Rebase on top of virtio-mdev series v4;
> > [...]
> > > > > > +
> > > > > > +static long vhost_mdev_get_features(struct vhost_mdev *m, u64 __user *featurep)
> > > > > > +{
> > > > > > +	if (copy_to_user(featurep, &m->features, sizeof(m->features)))
> > > > > > +		return -EFAULT;
> > > > > As discussed in previous version do we need to filter out MQ feature here?
> > > > I think it's more straightforward to let the parent drivers to
> > > > filter out the unsupported features. Otherwise it would be tricky
> > > > when we want to add more features in vhost-mdev module,
> > > 
> > > It's as simple as remove the feature from blacklist?
> > It's not really that easy. It may break the old drivers.
> 
> 
> I'm not sure I understand here, we do feature negotiation anyhow. For old
> drivers do you mean the guest drivers without MQ?

For old drivers I mean old parent drivers. It's possible
to compile old drivers on new kernels.

I'm not quite sure how will we implement MQ support in
vhost-mdev. If we need to introduce new virtio_mdev_device_ops
callbacks and an old driver exposed the MQ feature,
then the new vhost-mdev will see this old driver expose
MQ feature but not provide corresponding callbacks.

> 
> 
> > 
> > > 
> > > > i.e. if
> > > > the parent drivers may expose unsupported features and relay on
> > > > vhost-mdev to filter them out, these features will be exposed
> > > > to userspace automatically when they are enabled in vhost-mdev
> > > > in the future.
> > > 
> > > The issue is, it's only that vhost-mdev knows its own limitation. E.g in
> > > this patch, vhost-mdev only implements a subset of transport API, but parent
> > > doesn't know about that.
> > > 
> > > Still MQ as an example, there's no way (or no need) for parent to know that
> > > vhost-mdev does not support MQ.
> > The mdev is a MDEV_CLASS_ID_VHOST mdev device. When the parent
> > is being developed, it should know the currently supported features
> > of vhost-mdev.
> 
> 
> How can parent know MQ is not supported by vhost-mdev?

Good point. I agree vhost-mdev should filter out the unsupported
features. But in the meantime, I think drivers also shouldn't
expose unsupported features.

> 
> 
> > 
> > > And this allows old kenrel to work with new
> > > parent drivers.
> > The new drivers should provide things like VIRTIO_MDEV_F_VERSION_1
> > to be compatible with the old kernels. When VIRTIO_MDEV_F_VERSION_1
> > is provided/negotiated, the behaviours should be consistent.
> 
> 
> To be clear, I didn't mean a change in virtio-mdev API, I meant:
> 
> 1) old vhost-mdev kernel driver that filters out MQ
> 
> 2) new parent driver that support MQ
> 
> 
> > 
> > > So basically we have three choices here:
> > > 
> > > 1) Implement what vhost-user did and implement a generic vhost-mdev (but may
> > > still have lots of device specific code). To support advanced feature which
> > > requires the access to config, still lots of API that needs to be added.
> > > 
> > > 2) Implement what vhost-kernel did, have a generic vhost-mdev driver and a
> > > vhost bus on top for match a device specific API e.g vhost-mdev-net. We
> > > still have device specific API but limit them only to device specific
> > > module. Still require new ioctls for advanced feature like MQ.
> > > 
> > > 3) Simply expose all virtio-mdev transport to userspace.
> > Currently, virtio-mdev transport is a set of function callbacks
> > defined in kernel. How to simply expose virtio-mdev transport to
> > userspace?
> 
> 
> The most straightforward way is to have an 1:1 mapping between ioctl and
> virito_mdev_device_ops.

Seems we are already trying to do 1:1 mapping between ioctl
and virtio_mdev_device_ops in vhost-mdev now (the major piece
missing is get_device_id/get_config/set_config).


> 
> Thanks
> 
> 
> > 
> > 
> > > A generic module
> > > without any type specific code (like virtio-mdev). No need dedicated API for
> > > e.g MQ. But then the API will look much different than current vhost did.
> > > 
> > > Consider the limitation of 1) I tend to choose 2 or 3. What's you opinion?
> > > 
> > > 
>
Jason Wang Oct. 23, 2019, 10:29 a.m. UTC | #7
On 2019/10/23 下午6:11, Tiwei Bie wrote:
> On Wed, Oct 23, 2019 at 03:25:00PM +0800, Jason Wang wrote:
>> On 2019/10/23 下午3:07, Tiwei Bie wrote:
>>> On Wed, Oct 23, 2019 at 01:46:23PM +0800, Jason Wang wrote:
>>>> On 2019/10/23 上午11:02, Tiwei Bie wrote:
>>>>> On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
>>>>>> On 2019/10/22 下午5:52, Tiwei Bie wrote:
>>>>>>> This patch introduces a mdev based hardware vhost backend.
>>>>>>> This backend is built on top of the same abstraction used
>>>>>>> in virtio-mdev and provides a generic vhost interface for
>>>>>>> userspace to accelerate the virtio devices in guest.
>>>>>>>
>>>>>>> This backend is implemented as a mdev device driver on top
>>>>>>> of the same mdev device ops used in virtio-mdev but using
>>>>>>> a different mdev class id, and it will register the device
>>>>>>> as a VFIO device for userspace to use. Userspace can setup
>>>>>>> the IOMMU with the existing VFIO container/group APIs and
>>>>>>> then get the device fd with the device name. After getting
>>>>>>> the device fd of this device, userspace can use vhost ioctls
>>>>>>> to setup the backend.
>>>>>>>
>>>>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
>>>>>>> ---
>>>>>>> This patch depends on below series:
>>>>>>> https://lkml.org/lkml/2019/10/17/286
>>>>>>>
>>>>>>> v1 -> v2:
>>>>>>> - Replace _SET_STATE with _SET_STATUS (MST);
>>>>>>> - Check status bits at each step (MST);
>>>>>>> - Report the max ring size and max number of queues (MST);
>>>>>>> - Add missing MODULE_DEVICE_TABLE (Jason);
>>>>>>> - Only support the network backend w/o multiqueue for now;
>>>>>> Any idea on how to extend it to support devices other than net? I think we
>>>>>> want a generic API or an API that could be made generic in the future.
>>>>>>
>>>>>> Do we want to e.g having a generic vhost mdev for all kinds of devices or
>>>>>> introducing e.g vhost-net-mdev and vhost-scsi-mdev?
>>>>> One possible way is to do what vhost-user does. I.e. Apart from
>>>>> the generic ring, features, ... related ioctls, we also introduce
>>>>> device specific ioctls when we need them. As vhost-mdev just needs
>>>>> to forward configs between parent and userspace and even won't
>>>>> cache any info when possible,
>>>> So it looks to me this is only possible if we expose e.g set_config and
>>>> get_config to userspace.
>>> The set_config and get_config interface isn't really everything
>>> of device specific settings. We also have ctrlq in virtio-net.
>>
>> Yes, but it could be processed by the exist API. Isn't it? Just set ctrl vq
>> address and let parent to deal with that.
> I mean how to expose ctrlq related settings to userspace?


I think it works like:

1) userspace find ctrl_vq is supported

2) then it can allocate memory for ctrl vq and set its address through 
vhost-mdev

3) userspace can populate ctrl vq itself


>
>>
>>>>> I think it might be better to do
>>>>> this in one generic vhost-mdev module.
>>>> Looking at definitions of VhostUserRequest in qemu, it mixed generic API
>>>> with device specific API. If we want go this ways (a generic vhost-mdev),
>>>> more questions needs to be answered:
>>>>
>>>> 1) How could userspace know which type of vhost it would use? Do we need to
>>>> expose virtio subsystem device in for userspace this case?
>>>>
>>>> 2) That generic vhost-mdev module still need to filter out unsupported
>>>> ioctls for a specific type. E.g if it probes a net device, it should refuse
>>>> API for other type. This in fact a vhost-mdev-net but just not modularize it
>>>> on top of vhost-mdev.
>>>>
>>>>
>>>>>>> - Some minor fixes and improvements;
>>>>>>> - Rebase on top of virtio-mdev series v4;
>>> [...]
>>>>>>> +
>>>>>>> +static long vhost_mdev_get_features(struct vhost_mdev *m, u64 __user *featurep)
>>>>>>> +{
>>>>>>> +	if (copy_to_user(featurep, &m->features, sizeof(m->features)))
>>>>>>> +		return -EFAULT;
>>>>>> As discussed in previous version do we need to filter out MQ feature here?
>>>>> I think it's more straightforward to let the parent drivers to
>>>>> filter out the unsupported features. Otherwise it would be tricky
>>>>> when we want to add more features in vhost-mdev module,
>>>> It's as simple as remove the feature from blacklist?
>>> It's not really that easy. It may break the old drivers.
>>
>> I'm not sure I understand here, we do feature negotiation anyhow. For old
>> drivers do you mean the guest drivers without MQ?
> For old drivers I mean old parent drivers. It's possible
> to compile old drivers on new kernels.


Yes, but if old parent driver itself can not support MQ it should just 
not advertise that feature.


>
> I'm not quite sure how will we implement MQ support in
> vhost-mdev.


Yes, that's why I ask here. I think we want the vhost-mdev to be generic 
which means it's better not let vhost-mdev to know anything which is 
device specific. So this is a question that should be considered.


> If we need to introduce new virtio_mdev_device_ops
> callbacks and an old driver exposed the MQ feature,
> then the new vhost-mdev will see this old driver expose
> MQ feature but not provide corresponding callbacks.ean


That's exact the issue which current API can not handle, so that's why I 
suggest to filter MQ out for vhost-mdev.

And in the future, we can:

1) invent new ioctls and convert them to config access or

2) just exposing config for userspace to access (then vhost-mdev work 
much more similar to virtio-mdev).


>
>>
>>>>> i.e. if
>>>>> the parent drivers may expose unsupported features and relay on
>>>>> vhost-mdev to filter them out, these features will be exposed
>>>>> to userspace automatically when they are enabled in vhost-mdev
>>>>> in the future.
>>>> The issue is, it's only that vhost-mdev knows its own limitation. E.g in
>>>> this patch, vhost-mdev only implements a subset of transport API, but parent
>>>> doesn't know about that.
>>>>
>>>> Still MQ as an example, there's no way (or no need) for parent to know that
>>>> vhost-mdev does not support MQ.
>>> The mdev is a MDEV_CLASS_ID_VHOST mdev device. When the parent
>>> is being developed, it should know the currently supported features
>>> of vhost-mdev.
>>
>> How can parent know MQ is not supported by vhost-mdev?
> Good point. I agree vhost-mdev should filter out the unsupported
> features. But in the meantime, I think drivers also shouldn't
> expose unsupported features.


Exactly. But there's a case in the middle, e.g parent drivers support MQ 
and virtio-mdev can do that but not vhost-mdev.


>
>>
>>>> And this allows old kenrel to work with new
>>>> parent drivers.
>>> The new drivers should provide things like VIRTIO_MDEV_F_VERSION_1
>>> to be compatible with the old kernels. When VIRTIO_MDEV_F_VERSION_1
>>> is provided/negotiated, the behaviours should be consistent.
>>
>> To be clear, I didn't mean a change in virtio-mdev API, I meant:
>>
>> 1) old vhost-mdev kernel driver that filters out MQ
>>
>> 2) new parent driver that support MQ
>>
>>
>>>> So basically we have three choices here:
>>>>
>>>> 1) Implement what vhost-user did and implement a generic vhost-mdev (but may
>>>> still have lots of device specific code). To support advanced feature which
>>>> requires the access to config, still lots of API that needs to be added.
>>>>
>>>> 2) Implement what vhost-kernel did, have a generic vhost-mdev driver and a
>>>> vhost bus on top for match a device specific API e.g vhost-mdev-net. We
>>>> still have device specific API but limit them only to device specific
>>>> module. Still require new ioctls for advanced feature like MQ.
>>>>
>>>> 3) Simply expose all virtio-mdev transport to userspace.
>>> Currently, virtio-mdev transport is a set of function callbacks
>>> defined in kernel. How to simply expose virtio-mdev transport to
>>> userspace?
>>
>> The most straightforward way is to have an 1:1 mapping between ioctl and
>> virito_mdev_device_ops.
> Seems we are already trying to do 1:1 mapping between ioctl
> and virtio_mdev_device_ops in vhost-mdev now (the major piece
> missing is get_device_id/get_config/set_config).


Yes, with this we can have a device independent API. Do you think this 
is better?

Thanks


>
>
>> Thanks
>>
>>
>>>
>>>> A generic module
>>>> without any type specific code (like virtio-mdev). No need dedicated API for
>>>> e.g MQ. But then the API will look much different than current vhost did.
>>>>
>>>> Consider the limitation of 1) I tend to choose 2 or 3. What's you opinion?
>>>>
>>>>
Tiwei Bie Oct. 24, 2019, 4:21 a.m. UTC | #8
On Wed, Oct 23, 2019 at 06:29:21PM +0800, Jason Wang wrote:
> On 2019/10/23 下午6:11, Tiwei Bie wrote:
> > On Wed, Oct 23, 2019 at 03:25:00PM +0800, Jason Wang wrote:
> > > On 2019/10/23 下午3:07, Tiwei Bie wrote:
> > > > On Wed, Oct 23, 2019 at 01:46:23PM +0800, Jason Wang wrote:
> > > > > On 2019/10/23 上午11:02, Tiwei Bie wrote:
> > > > > > On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
> > > > > > > On 2019/10/22 下午5:52, Tiwei Bie wrote:
> > > > > > > > This patch introduces a mdev based hardware vhost backend.
> > > > > > > > This backend is built on top of the same abstraction used
> > > > > > > > in virtio-mdev and provides a generic vhost interface for
> > > > > > > > userspace to accelerate the virtio devices in guest.
> > > > > > > > 
> > > > > > > > This backend is implemented as a mdev device driver on top
> > > > > > > > of the same mdev device ops used in virtio-mdev but using
> > > > > > > > a different mdev class id, and it will register the device
> > > > > > > > as a VFIO device for userspace to use. Userspace can setup
> > > > > > > > the IOMMU with the existing VFIO container/group APIs and
> > > > > > > > then get the device fd with the device name. After getting
> > > > > > > > the device fd of this device, userspace can use vhost ioctls
> > > > > > > > to setup the backend.
> > > > > > > > 
> > > > > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > > > > ---
> > > > > > > > This patch depends on below series:
> > > > > > > > https://lkml.org/lkml/2019/10/17/286
> > > > > > > > 
> > > > > > > > v1 -> v2:
> > > > > > > > - Replace _SET_STATE with _SET_STATUS (MST);
> > > > > > > > - Check status bits at each step (MST);
> > > > > > > > - Report the max ring size and max number of queues (MST);
> > > > > > > > - Add missing MODULE_DEVICE_TABLE (Jason);
> > > > > > > > - Only support the network backend w/o multiqueue for now;
> > > > > > > Any idea on how to extend it to support devices other than net? I think we
> > > > > > > want a generic API or an API that could be made generic in the future.
> > > > > > > 
> > > > > > > Do we want to e.g having a generic vhost mdev for all kinds of devices or
> > > > > > > introducing e.g vhost-net-mdev and vhost-scsi-mdev?
> > > > > > One possible way is to do what vhost-user does. I.e. Apart from
> > > > > > the generic ring, features, ... related ioctls, we also introduce
> > > > > > device specific ioctls when we need them. As vhost-mdev just needs
> > > > > > to forward configs between parent and userspace and even won't
> > > > > > cache any info when possible,
> > > > > So it looks to me this is only possible if we expose e.g set_config and
> > > > > get_config to userspace.
> > > > The set_config and get_config interface isn't really everything
> > > > of device specific settings. We also have ctrlq in virtio-net.
> > > 
> > > Yes, but it could be processed by the exist API. Isn't it? Just set ctrl vq
> > > address and let parent to deal with that.
> > I mean how to expose ctrlq related settings to userspace?
> 
> 
> I think it works like:
> 
> 1) userspace find ctrl_vq is supported
> 
> 2) then it can allocate memory for ctrl vq and set its address through
> vhost-mdev
> 
> 3) userspace can populate ctrl vq itself

I see. That is to say, userspace e.g. QEMU will program the
ctrl vq with the existing VHOST_*_VRING_* ioctls, and parent
drivers should know that the addresses used in ctrl vq are
host virtual addresses in vhost-mdev's case.

> 
> 
> > 
> > > 
> > > > > > I think it might be better to do
> > > > > > this in one generic vhost-mdev module.
> > > > > Looking at definitions of VhostUserRequest in qemu, it mixed generic API
> > > > > with device specific API. If we want go this ways (a generic vhost-mdev),
> > > > > more questions needs to be answered:
> > > > > 
> > > > > 1) How could userspace know which type of vhost it would use? Do we need to
> > > > > expose virtio subsystem device in for userspace this case?
> > > > > 
> > > > > 2) That generic vhost-mdev module still need to filter out unsupported
> > > > > ioctls for a specific type. E.g if it probes a net device, it should refuse
> > > > > API for other type. This in fact a vhost-mdev-net but just not modularize it
> > > > > on top of vhost-mdev.
> > > > > 
> > > > > 
> > > > > > > > - Some minor fixes and improvements;
> > > > > > > > - Rebase on top of virtio-mdev series v4;
> > > > [...]
> > > > > > > > +
> > > > > > > > +static long vhost_mdev_get_features(struct vhost_mdev *m, u64 __user *featurep)
> > > > > > > > +{
> > > > > > > > +	if (copy_to_user(featurep, &m->features, sizeof(m->features)))
> > > > > > > > +		return -EFAULT;
> > > > > > > As discussed in previous version do we need to filter out MQ feature here?
> > > > > > I think it's more straightforward to let the parent drivers to
> > > > > > filter out the unsupported features. Otherwise it would be tricky
> > > > > > when we want to add more features in vhost-mdev module,
> > > > > It's as simple as remove the feature from blacklist?
> > > > It's not really that easy. It may break the old drivers.
> > > 
> > > I'm not sure I understand here, we do feature negotiation anyhow. For old
> > > drivers do you mean the guest drivers without MQ?
> > For old drivers I mean old parent drivers. It's possible
> > to compile old drivers on new kernels.
> 
> 
> Yes, but if old parent driver itself can not support MQ it should just not
> advertise that feature.
> 
> 
> > 
> > I'm not quite sure how will we implement MQ support in
> > vhost-mdev.
> 
> 
> Yes, that's why I ask here. I think we want the vhost-mdev to be generic
> which means it's better not let vhost-mdev to know anything which is device
> specific. So this is a question that should be considered.

+1

> 
> 
> > If we need to introduce new virtio_mdev_device_ops
> > callbacks and an old driver exposed the MQ feature,
> > then the new vhost-mdev will see this old driver expose
> > MQ feature but not provide corresponding callbacks.ean
> 
> 
> That's exact the issue which current API can not handle, so that's why I
> suggest to filter MQ out for vhost-mdev.
> 
> And in the future, we can:
> 
> 1) invent new ioctls and convert them to config access or
> 
> 2) just exposing config for userspace to access (then vhost-mdev work much
> more similar to virtio-mdev).
> 
> 
> > 
> > > 
> > > > > > i.e. if
> > > > > > the parent drivers may expose unsupported features and relay on
> > > > > > vhost-mdev to filter them out, these features will be exposed
> > > > > > to userspace automatically when they are enabled in vhost-mdev
> > > > > > in the future.
> > > > > The issue is, it's only that vhost-mdev knows its own limitation. E.g in
> > > > > this patch, vhost-mdev only implements a subset of transport API, but parent
> > > > > doesn't know about that.
> > > > > 
> > > > > Still MQ as an example, there's no way (or no need) for parent to know that
> > > > > vhost-mdev does not support MQ.
> > > > The mdev is a MDEV_CLASS_ID_VHOST mdev device. When the parent
> > > > is being developed, it should know the currently supported features
> > > > of vhost-mdev.
> > > 
> > > How can parent know MQ is not supported by vhost-mdev?
> > Good point. I agree vhost-mdev should filter out the unsupported
> > features. But in the meantime, I think drivers also shouldn't
> > expose unsupported features.
> 
> 
> Exactly. But there's a case in the middle, e.g parent drivers support MQ and
> virtio-mdev can do that but not vhost-mdev.

As we have different mdev class IDs between virtio-mdev and
vhost-mdev, maybe parent can leverage it to return different
sets of supported features for virtio-mdev and vhost-mdev
respectively.

> 
> 
> > 
> > > 
> > > > > And this allows old kenrel to work with new
> > > > > parent drivers.
> > > > The new drivers should provide things like VIRTIO_MDEV_F_VERSION_1
> > > > to be compatible with the old kernels. When VIRTIO_MDEV_F_VERSION_1
> > > > is provided/negotiated, the behaviours should be consistent.
> > > 
> > > To be clear, I didn't mean a change in virtio-mdev API, I meant:
> > > 
> > > 1) old vhost-mdev kernel driver that filters out MQ
> > > 
> > > 2) new parent driver that support MQ
> > > 
> > > 
> > > > > So basically we have three choices here:
> > > > > 
> > > > > 1) Implement what vhost-user did and implement a generic vhost-mdev (but may
> > > > > still have lots of device specific code). To support advanced feature which
> > > > > requires the access to config, still lots of API that needs to be added.
> > > > > 
> > > > > 2) Implement what vhost-kernel did, have a generic vhost-mdev driver and a
> > > > > vhost bus on top for match a device specific API e.g vhost-mdev-net. We
> > > > > still have device specific API but limit them only to device specific
> > > > > module. Still require new ioctls for advanced feature like MQ.
> > > > > 
> > > > > 3) Simply expose all virtio-mdev transport to userspace.
> > > > Currently, virtio-mdev transport is a set of function callbacks
> > > > defined in kernel. How to simply expose virtio-mdev transport to
> > > > userspace?
> > > 
> > > The most straightforward way is to have an 1:1 mapping between ioctl and
> > > virito_mdev_device_ops.
> > Seems we are already trying to do 1:1 mapping between ioctl
> > and virtio_mdev_device_ops in vhost-mdev now (the major piece
> > missing is get_device_id/get_config/set_config).
> 
> 
> Yes, with this we can have a device independent API. Do you think this is
> better?

Yeah, I agree.

Thanks,
Tiwei

> 
> Thanks
> 
> 
> > 
> > 
> > > Thanks
> > > 
> > > 
> > > > 
> > > > > A generic module
> > > > > without any type specific code (like virtio-mdev). No need dedicated API for
> > > > > e.g MQ. But then the API will look much different than current vhost did.
> > > > > 
> > > > > Consider the limitation of 1) I tend to choose 2 or 3. What's you opinion?
> > > > > 
> > > > > 
>
Jason Wang Oct. 24, 2019, 8:03 a.m. UTC | #9
On 2019/10/24 下午12:21, Tiwei Bie wrote:
> On Wed, Oct 23, 2019 at 06:29:21PM +0800, Jason Wang wrote:
>> On 2019/10/23 下午6:11, Tiwei Bie wrote:
>>> On Wed, Oct 23, 2019 at 03:25:00PM +0800, Jason Wang wrote:
>>>> On 2019/10/23 下午3:07, Tiwei Bie wrote:
>>>>> On Wed, Oct 23, 2019 at 01:46:23PM +0800, Jason Wang wrote:
>>>>>> On 2019/10/23 上午11:02, Tiwei Bie wrote:
>>>>>>> On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
>>>>>>>> On 2019/10/22 下午5:52, Tiwei Bie wrote:
>>>>>>>>> This patch introduces a mdev based hardware vhost backend.
>>>>>>>>> This backend is built on top of the same abstraction used
>>>>>>>>> in virtio-mdev and provides a generic vhost interface for
>>>>>>>>> userspace to accelerate the virtio devices in guest.
>>>>>>>>>
>>>>>>>>> This backend is implemented as a mdev device driver on top
>>>>>>>>> of the same mdev device ops used in virtio-mdev but using
>>>>>>>>> a different mdev class id, and it will register the device
>>>>>>>>> as a VFIO device for userspace to use. Userspace can setup
>>>>>>>>> the IOMMU with the existing VFIO container/group APIs and
>>>>>>>>> then get the device fd with the device name. After getting
>>>>>>>>> the device fd of this device, userspace can use vhost ioctls
>>>>>>>>> to setup the backend.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
>>>>>>>>> ---
>>>>>>>>> This patch depends on below series:
>>>>>>>>> https://lkml.org/lkml/2019/10/17/286
>>>>>>>>>
>>>>>>>>> v1 -> v2:
>>>>>>>>> - Replace _SET_STATE with _SET_STATUS (MST);
>>>>>>>>> - Check status bits at each step (MST);
>>>>>>>>> - Report the max ring size and max number of queues (MST);
>>>>>>>>> - Add missing MODULE_DEVICE_TABLE (Jason);
>>>>>>>>> - Only support the network backend w/o multiqueue for now;
>>>>>>>> Any idea on how to extend it to support devices other than net? I think we
>>>>>>>> want a generic API or an API that could be made generic in the future.
>>>>>>>>
>>>>>>>> Do we want to e.g having a generic vhost mdev for all kinds of devices or
>>>>>>>> introducing e.g vhost-net-mdev and vhost-scsi-mdev?
>>>>>>> One possible way is to do what vhost-user does. I.e. Apart from
>>>>>>> the generic ring, features, ... related ioctls, we also introduce
>>>>>>> device specific ioctls when we need them. As vhost-mdev just needs
>>>>>>> to forward configs between parent and userspace and even won't
>>>>>>> cache any info when possible,
>>>>>> So it looks to me this is only possible if we expose e.g set_config and
>>>>>> get_config to userspace.
>>>>> The set_config and get_config interface isn't really everything
>>>>> of device specific settings. We also have ctrlq in virtio-net.
>>>> Yes, but it could be processed by the exist API. Isn't it? Just set ctrl vq
>>>> address and let parent to deal with that.
>>> I mean how to expose ctrlq related settings to userspace?
>>
>> I think it works like:
>>
>> 1) userspace find ctrl_vq is supported
>>
>> 2) then it can allocate memory for ctrl vq and set its address through
>> vhost-mdev
>>
>> 3) userspace can populate ctrl vq itself
> I see. That is to say, userspace e.g. QEMU will program the
> ctrl vq with the existing VHOST_*_VRING_* ioctls, and parent
> drivers should know that the addresses used in ctrl vq are
> host virtual addresses in vhost-mdev's case.


That's really good point. And that means parent needs to differ vhost 
from virtio. It should work. But is there any chance to use DMA address? 
I'm asking since the API then tends to be device specific.


>
>>
>>>>>>> I think it might be better to do
>>>>>>> this in one generic vhost-mdev module.
>>>>>> Looking at definitions of VhostUserRequest in qemu, it mixed generic API
>>>>>> with device specific API. If we want go this ways (a generic vhost-mdev),
>>>>>> more questions needs to be answered:
>>>>>>
>>>>>> 1) How could userspace know which type of vhost it would use? Do we need to
>>>>>> expose virtio subsystem device in for userspace this case?
>>>>>>
>>>>>> 2) That generic vhost-mdev module still need to filter out unsupported
>>>>>> ioctls for a specific type. E.g if it probes a net device, it should refuse
>>>>>> API for other type. This in fact a vhost-mdev-net but just not modularize it
>>>>>> on top of vhost-mdev.
>>>>>>
>>>>>>
>>>>>>>>> - Some minor fixes and improvements;
>>>>>>>>> - Rebase on top of virtio-mdev series v4;
>>>>> [...]
>>>>>>>>> +
>>>>>>>>> +static long vhost_mdev_get_features(struct vhost_mdev *m, u64 __user *featurep)
>>>>>>>>> +{
>>>>>>>>> +	if (copy_to_user(featurep, &m->features, sizeof(m->features)))
>>>>>>>>> +		return -EFAULT;
>>>>>>>> As discussed in previous version do we need to filter out MQ feature here?
>>>>>>> I think it's more straightforward to let the parent drivers to
>>>>>>> filter out the unsupported features. Otherwise it would be tricky
>>>>>>> when we want to add more features in vhost-mdev module,
>>>>>> It's as simple as remove the feature from blacklist?
>>>>> It's not really that easy. It may break the old drivers.
>>>> I'm not sure I understand here, we do feature negotiation anyhow. For old
>>>> drivers do you mean the guest drivers without MQ?
>>> For old drivers I mean old parent drivers. It's possible
>>> to compile old drivers on new kernels.
>>
>> Yes, but if old parent driver itself can not support MQ it should just not
>> advertise that feature.
>>
>>
>>> I'm not quite sure how will we implement MQ support in
>>> vhost-mdev.
>>
>> Yes, that's why I ask here. I think we want the vhost-mdev to be generic
>> which means it's better not let vhost-mdev to know anything which is device
>> specific. So this is a question that should be considered.
> +1
>
>>
>>> If we need to introduce new virtio_mdev_device_ops
>>> callbacks and an old driver exposed the MQ feature,
>>> then the new vhost-mdev will see this old driver expose
>>> MQ feature but not provide corresponding callbacks.ean
>>
>> That's exact the issue which current API can not handle, so that's why I
>> suggest to filter MQ out for vhost-mdev.
>>
>> And in the future, we can:
>>
>> 1) invent new ioctls and convert them to config access or
>>
>> 2) just exposing config for userspace to access (then vhost-mdev work much
>> more similar to virtio-mdev).
>>
>>
>>>>>>> i.e. if
>>>>>>> the parent drivers may expose unsupported features and relay on
>>>>>>> vhost-mdev to filter them out, these features will be exposed
>>>>>>> to userspace automatically when they are enabled in vhost-mdev
>>>>>>> in the future.
>>>>>> The issue is, it's only that vhost-mdev knows its own limitation. E.g in
>>>>>> this patch, vhost-mdev only implements a subset of transport API, but parent
>>>>>> doesn't know about that.
>>>>>>
>>>>>> Still MQ as an example, there's no way (or no need) for parent to know that
>>>>>> vhost-mdev does not support MQ.
>>>>> The mdev is a MDEV_CLASS_ID_VHOST mdev device. When the parent
>>>>> is being developed, it should know the currently supported features
>>>>> of vhost-mdev.
>>>> How can parent know MQ is not supported by vhost-mdev?
>>> Good point. I agree vhost-mdev should filter out the unsupported
>>> features. But in the meantime, I think drivers also shouldn't
>>> expose unsupported features.
>>
>> Exactly. But there's a case in the middle, e.g parent drivers support MQ and
>> virtio-mdev can do that but not vhost-mdev.
> As we have different mdev class IDs between virtio-mdev and
> vhost-mdev, maybe parent can leverage it to return different
> sets of supported features for virtio-mdev and vhost-mdev
> respectively.


Yes, that should work.

Thanks


>
>>
>>>>>> And this allows old kenrel to work with new
>>>>>> parent drivers.
>>>>> The new drivers should provide things like VIRTIO_MDEV_F_VERSION_1
>>>>> to be compatible with the old kernels. When VIRTIO_MDEV_F_VERSION_1
>>>>> is provided/negotiated, the behaviours should be consistent.
>>>> To be clear, I didn't mean a change in virtio-mdev API, I meant:
>>>>
>>>> 1) old vhost-mdev kernel driver that filters out MQ
>>>>
>>>> 2) new parent driver that support MQ
>>>>
>>>>
>>>>>> So basically we have three choices here:
>>>>>>
>>>>>> 1) Implement what vhost-user did and implement a generic vhost-mdev (but may
>>>>>> still have lots of device specific code). To support advanced feature which
>>>>>> requires the access to config, still lots of API that needs to be added.
>>>>>>
>>>>>> 2) Implement what vhost-kernel did, have a generic vhost-mdev driver and a
>>>>>> vhost bus on top for match a device specific API e.g vhost-mdev-net. We
>>>>>> still have device specific API but limit them only to device specific
>>>>>> module. Still require new ioctls for advanced feature like MQ.
>>>>>>
>>>>>> 3) Simply expose all virtio-mdev transport to userspace.
>>>>> Currently, virtio-mdev transport is a set of function callbacks
>>>>> defined in kernel. How to simply expose virtio-mdev transport to
>>>>> userspace?
>>>> The most straightforward way is to have an 1:1 mapping between ioctl and
>>>> virito_mdev_device_ops.
>>> Seems we are already trying to do 1:1 mapping between ioctl
>>> and virtio_mdev_device_ops in vhost-mdev now (the major piece
>>> missing is get_device_id/get_config/set_config).
>>
>> Yes, with this we can have a device independent API. Do you think this is
>> better?
> Yeah, I agree.
>
> Thanks,
> Tiwei
>
>> Thanks
>>
>>
>>>
>>>> Thanks
>>>>
>>>>
>>>>>> A generic module
>>>>>> without any type specific code (like virtio-mdev). No need dedicated API for
>>>>>> e.g MQ. But then the API will look much different than current vhost did.
>>>>>>
>>>>>> Consider the limitation of 1) I tend to choose 2 or 3. What's you opinion?
>>>>>>
>>>>>>
Jason Wang Oct. 24, 2019, 8:32 a.m. UTC | #10
On 2019/10/24 下午4:03, Jason Wang wrote:
>
> On 2019/10/24 下午12:21, Tiwei Bie wrote:
>> On Wed, Oct 23, 2019 at 06:29:21PM +0800, Jason Wang wrote:
>>> On 2019/10/23 下午6:11, Tiwei Bie wrote:
>>>> On Wed, Oct 23, 2019 at 03:25:00PM +0800, Jason Wang wrote:
>>>>> On 2019/10/23 下午3:07, Tiwei Bie wrote:
>>>>>> On Wed, Oct 23, 2019 at 01:46:23PM +0800, Jason Wang wrote:
>>>>>>> On 2019/10/23 上午11:02, Tiwei Bie wrote:
>>>>>>>> On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
>>>>>>>>> On 2019/10/22 下午5:52, Tiwei Bie wrote:
>>>>>>>>>> This patch introduces a mdev based hardware vhost backend.
>>>>>>>>>> This backend is built on top of the same abstraction used
>>>>>>>>>> in virtio-mdev and provides a generic vhost interface for
>>>>>>>>>> userspace to accelerate the virtio devices in guest.
>>>>>>>>>>
>>>>>>>>>> This backend is implemented as a mdev device driver on top
>>>>>>>>>> of the same mdev device ops used in virtio-mdev but using
>>>>>>>>>> a different mdev class id, and it will register the device
>>>>>>>>>> as a VFIO device for userspace to use. Userspace can setup
>>>>>>>>>> the IOMMU with the existing VFIO container/group APIs and
>>>>>>>>>> then get the device fd with the device name. After getting
>>>>>>>>>> the device fd of this device, userspace can use vhost ioctls
>>>>>>>>>> to setup the backend.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
>>>>>>>>>> ---
>>>>>>>>>> This patch depends on below series:
>>>>>>>>>> https://lkml.org/lkml/2019/10/17/286
>>>>>>>>>>
>>>>>>>>>> v1 -> v2:
>>>>>>>>>> - Replace _SET_STATE with _SET_STATUS (MST);
>>>>>>>>>> - Check status bits at each step (MST);
>>>>>>>>>> - Report the max ring size and max number of queues (MST);
>>>>>>>>>> - Add missing MODULE_DEVICE_TABLE (Jason);
>>>>>>>>>> - Only support the network backend w/o multiqueue for now;
>>>>>>>>> Any idea on how to extend it to support devices other than 
>>>>>>>>> net? I think we
>>>>>>>>> want a generic API or an API that could be made generic in the 
>>>>>>>>> future.
>>>>>>>>>
>>>>>>>>> Do we want to e.g having a generic vhost mdev for all kinds of 
>>>>>>>>> devices or
>>>>>>>>> introducing e.g vhost-net-mdev and vhost-scsi-mdev?
>>>>>>>> One possible way is to do what vhost-user does. I.e. Apart from
>>>>>>>> the generic ring, features, ... related ioctls, we also introduce
>>>>>>>> device specific ioctls when we need them. As vhost-mdev just needs
>>>>>>>> to forward configs between parent and userspace and even won't
>>>>>>>> cache any info when possible,
>>>>>>> So it looks to me this is only possible if we expose e.g 
>>>>>>> set_config and
>>>>>>> get_config to userspace.
>>>>>> The set_config and get_config interface isn't really everything
>>>>>> of device specific settings. We also have ctrlq in virtio-net.
>>>>> Yes, but it could be processed by the exist API. Isn't it? Just 
>>>>> set ctrl vq
>>>>> address and let parent to deal with that.
>>>> I mean how to expose ctrlq related settings to userspace?
>>>
>>> I think it works like:
>>>
>>> 1) userspace find ctrl_vq is supported
>>>
>>> 2) then it can allocate memory for ctrl vq and set its address through
>>> vhost-mdev
>>>
>>> 3) userspace can populate ctrl vq itself
>> I see. That is to say, userspace e.g. QEMU will program the
>> ctrl vq with the existing VHOST_*_VRING_* ioctls, and parent
>> drivers should know that the addresses used in ctrl vq are
>> host virtual addresses in vhost-mdev's case.
>
>
> That's really good point. And that means parent needs to differ vhost 
> from virtio. It should work. 


HVA may only work when we have something similar to VHOST_SET_OWNER 
which can reuse MM of its owner.


> But is there any chance to use DMA address? I'm asking since the API 
> then tends to be device specific. 


I wonder whether we can introduce MAP IOMMU notifier and get DMA 
mappings from that.

Thanks
Tiwei Bie Oct. 24, 2019, 9:18 a.m. UTC | #11
On Thu, Oct 24, 2019 at 04:32:42PM +0800, Jason Wang wrote:
> On 2019/10/24 下午4:03, Jason Wang wrote:
> > On 2019/10/24 下午12:21, Tiwei Bie wrote:
> > > On Wed, Oct 23, 2019 at 06:29:21PM +0800, Jason Wang wrote:
> > > > On 2019/10/23 下午6:11, Tiwei Bie wrote:
> > > > > On Wed, Oct 23, 2019 at 03:25:00PM +0800, Jason Wang wrote:
> > > > > > On 2019/10/23 下午3:07, Tiwei Bie wrote:
> > > > > > > On Wed, Oct 23, 2019 at 01:46:23PM +0800, Jason Wang wrote:
> > > > > > > > On 2019/10/23 上午11:02, Tiwei Bie wrote:
> > > > > > > > > On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
> > > > > > > > > > On 2019/10/22 下午5:52, Tiwei Bie wrote:
> > > > > > > > > > > This patch introduces a mdev based hardware vhost backend.
> > > > > > > > > > > This backend is built on top of the same abstraction used
> > > > > > > > > > > in virtio-mdev and provides a generic vhost interface for
> > > > > > > > > > > userspace to accelerate the virtio devices in guest.
> > > > > > > > > > > 
> > > > > > > > > > > This backend is implemented as a mdev device driver on top
> > > > > > > > > > > of the same mdev device ops used in virtio-mdev but using
> > > > > > > > > > > a different mdev class id, and it will register the device
> > > > > > > > > > > as a VFIO device for userspace to use. Userspace can setup
> > > > > > > > > > > the IOMMU with the existing VFIO container/group APIs and
> > > > > > > > > > > then get the device fd with the device name. After getting
> > > > > > > > > > > the device fd of this device, userspace can use vhost ioctls
> > > > > > > > > > > to setup the backend.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > > > > > > > ---
> > > > > > > > > > > This patch depends on below series:
> > > > > > > > > > > https://lkml.org/lkml/2019/10/17/286
> > > > > > > > > > > 
> > > > > > > > > > > v1 -> v2:
> > > > > > > > > > > - Replace _SET_STATE with _SET_STATUS (MST);
> > > > > > > > > > > - Check status bits at each step (MST);
> > > > > > > > > > > - Report the max ring size and max number of queues (MST);
> > > > > > > > > > > - Add missing MODULE_DEVICE_TABLE (Jason);
> > > > > > > > > > > - Only support the network backend w/o multiqueue for now;
> > > > > > > > > > Any idea on how to extend it to support
> > > > > > > > > > devices other than net? I think we
> > > > > > > > > > want a generic API or an API that could
> > > > > > > > > > be made generic in the future.
> > > > > > > > > > 
> > > > > > > > > > Do we want to e.g having a generic vhost
> > > > > > > > > > mdev for all kinds of devices or
> > > > > > > > > > introducing e.g vhost-net-mdev and vhost-scsi-mdev?
> > > > > > > > > One possible way is to do what vhost-user does. I.e. Apart from
> > > > > > > > > the generic ring, features, ... related ioctls, we also introduce
> > > > > > > > > device specific ioctls when we need them. As vhost-mdev just needs
> > > > > > > > > to forward configs between parent and userspace and even won't
> > > > > > > > > cache any info when possible,
> > > > > > > > So it looks to me this is only possible if we
> > > > > > > > expose e.g set_config and
> > > > > > > > get_config to userspace.
> > > > > > > The set_config and get_config interface isn't really everything
> > > > > > > of device specific settings. We also have ctrlq in virtio-net.
> > > > > > Yes, but it could be processed by the exist API. Isn't
> > > > > > it? Just set ctrl vq
> > > > > > address and let parent to deal with that.
> > > > > I mean how to expose ctrlq related settings to userspace?
> > > > 
> > > > I think it works like:
> > > > 
> > > > 1) userspace find ctrl_vq is supported
> > > > 
> > > > 2) then it can allocate memory for ctrl vq and set its address through
> > > > vhost-mdev
> > > > 
> > > > 3) userspace can populate ctrl vq itself
> > > I see. That is to say, userspace e.g. QEMU will program the
> > > ctrl vq with the existing VHOST_*_VRING_* ioctls, and parent
> > > drivers should know that the addresses used in ctrl vq are
> > > host virtual addresses in vhost-mdev's case.
> > 
> > 
> > That's really good point. And that means parent needs to differ vhost
> > from virtio. It should work.
> 
> 
> HVA may only work when we have something similar to VHOST_SET_OWNER which
> can reuse MM of its owner.

We already have VHOST_SET_OWNER in vhost now, parent can handle
the commands in its .kick_vq() which is called by vq's .handle_kick
callback. Virtio-user did something similar:

https://github.com/DPDK/dpdk/blob/0da7f445df445630c794897347ee360d6fe6348b/drivers/net/virtio/virtio_user_ethdev.c#L313-L322

> 
> 
> > But is there any chance to use DMA address? I'm asking since the API
> > then tends to be device specific.
> 
> 
> I wonder whether we can introduce MAP IOMMU notifier and get DMA mappings
> from that.

I think this will complicate things unnecessarily and may
bring pains. Because, in vhost-mdev, mdev's ctrl vq is
supposed to be managed by host. And we should try to avoid
putting ctrl vq and Rx/Tx vqs in the same DMA space to prevent
guests having the chance to bypass the host (e.g. QEMU) to
setup the backend accelerator directly.

> 
> Thanks
>
Jason Wang Oct. 24, 2019, 10:42 a.m. UTC | #12
On 2019/10/24 下午5:18, Tiwei Bie wrote:
> On Thu, Oct 24, 2019 at 04:32:42PM +0800, Jason Wang wrote:
>> On 2019/10/24 下午4:03, Jason Wang wrote:
>>> On 2019/10/24 下午12:21, Tiwei Bie wrote:
>>>> On Wed, Oct 23, 2019 at 06:29:21PM +0800, Jason Wang wrote:
>>>>> On 2019/10/23 下午6:11, Tiwei Bie wrote:
>>>>>> On Wed, Oct 23, 2019 at 03:25:00PM +0800, Jason Wang wrote:
>>>>>>> On 2019/10/23 下午3:07, Tiwei Bie wrote:
>>>>>>>> On Wed, Oct 23, 2019 at 01:46:23PM +0800, Jason Wang wrote:
>>>>>>>>> On 2019/10/23 上午11:02, Tiwei Bie wrote:
>>>>>>>>>> On Tue, Oct 22, 2019 at 09:30:16PM +0800, Jason Wang wrote:
>>>>>>>>>>> On 2019/10/22 下午5:52, Tiwei Bie wrote:
>>>>>>>>>>>> This patch introduces a mdev based hardware vhost backend.
>>>>>>>>>>>> This backend is built on top of the same abstraction used
>>>>>>>>>>>> in virtio-mdev and provides a generic vhost interface for
>>>>>>>>>>>> userspace to accelerate the virtio devices in guest.
>>>>>>>>>>>>
>>>>>>>>>>>> This backend is implemented as a mdev device driver on top
>>>>>>>>>>>> of the same mdev device ops used in virtio-mdev but using
>>>>>>>>>>>> a different mdev class id, and it will register the device
>>>>>>>>>>>> as a VFIO device for userspace to use. Userspace can setup
>>>>>>>>>>>> the IOMMU with the existing VFIO container/group APIs and
>>>>>>>>>>>> then get the device fd with the device name. After getting
>>>>>>>>>>>> the device fd of this device, userspace can use vhost ioctls
>>>>>>>>>>>> to setup the backend.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> This patch depends on below series:
>>>>>>>>>>>> https://lkml.org/lkml/2019/10/17/286
>>>>>>>>>>>>
>>>>>>>>>>>> v1 -> v2:
>>>>>>>>>>>> - Replace _SET_STATE with _SET_STATUS (MST);
>>>>>>>>>>>> - Check status bits at each step (MST);
>>>>>>>>>>>> - Report the max ring size and max number of queues (MST);
>>>>>>>>>>>> - Add missing MODULE_DEVICE_TABLE (Jason);
>>>>>>>>>>>> - Only support the network backend w/o multiqueue for now;
>>>>>>>>>>> Any idea on how to extend it to support
>>>>>>>>>>> devices other than net? I think we
>>>>>>>>>>> want a generic API or an API that could
>>>>>>>>>>> be made generic in the future.
>>>>>>>>>>>
>>>>>>>>>>> Do we want to e.g having a generic vhost
>>>>>>>>>>> mdev for all kinds of devices or
>>>>>>>>>>> introducing e.g vhost-net-mdev and vhost-scsi-mdev?
>>>>>>>>>> One possible way is to do what vhost-user does. I.e. Apart from
>>>>>>>>>> the generic ring, features, ... related ioctls, we also introduce
>>>>>>>>>> device specific ioctls when we need them. As vhost-mdev just needs
>>>>>>>>>> to forward configs between parent and userspace and even won't
>>>>>>>>>> cache any info when possible,
>>>>>>>>> So it looks to me this is only possible if we
>>>>>>>>> expose e.g set_config and
>>>>>>>>> get_config to userspace.
>>>>>>>> The set_config and get_config interface isn't really everything
>>>>>>>> of device specific settings. We also have ctrlq in virtio-net.
>>>>>>> Yes, but it could be processed by the exist API. Isn't
>>>>>>> it? Just set ctrl vq
>>>>>>> address and let parent to deal with that.
>>>>>> I mean how to expose ctrlq related settings to userspace?
>>>>> I think it works like:
>>>>>
>>>>> 1) userspace find ctrl_vq is supported
>>>>>
>>>>> 2) then it can allocate memory for ctrl vq and set its address through
>>>>> vhost-mdev
>>>>>
>>>>> 3) userspace can populate ctrl vq itself
>>>> I see. That is to say, userspace e.g. QEMU will program the
>>>> ctrl vq with the existing VHOST_*_VRING_* ioctls, and parent
>>>> drivers should know that the addresses used in ctrl vq are
>>>> host virtual addresses in vhost-mdev's case.
>>>
>>> That's really good point. And that means parent needs to differ vhost
>>> from virtio. It should work.
>>
>> HVA may only work when we have something similar to VHOST_SET_OWNER which
>> can reuse MM of its owner.
> We already have VHOST_SET_OWNER in vhost now, parent can handle
> the commands in its .kick_vq() which is called by vq's .handle_kick
> callback. Virtio-user did something similar:
>
> https://github.com/DPDK/dpdk/blob/0da7f445df445630c794897347ee360d6fe6348b/drivers/net/virtio/virtio_user_ethdev.c#L313-L322


This probably means a process context is required, something like 
kthread that is used by vhost which seems a burden for parent. Or we can 
extend ioctl to processing kick in the system call context.


>
>>
>>> But is there any chance to use DMA address? I'm asking since the API
>>> then tends to be device specific.
>>
>> I wonder whether we can introduce MAP IOMMU notifier and get DMA mappings
>> from that.
> I think this will complicate things unnecessarily and may
> bring pains. Because, in vhost-mdev, mdev's ctrl vq is
> supposed to be managed by host.


Yes.


>   And we should try to avoid
> putting ctrl vq and Rx/Tx vqs in the same DMA space to prevent
> guests having the chance to bypass the host (e.g. QEMU) to
> setup the backend accelerator directly.


That's really good point.  So when "vhost" type is created, parent 
should assume addr of ctrl_vq is hva.

Thanks


>
>> Thanks
>>
Jason Wang Oct. 25, 2019, 9:54 a.m. UTC | #13
On 2019/10/24 下午6:42, Jason Wang wrote:
>
> Yes.
>
>
>>   And we should try to avoid
>> putting ctrl vq and Rx/Tx vqs in the same DMA space to prevent
>> guests having the chance to bypass the host (e.g. QEMU) to
>> setup the backend accelerator directly.
>
>
> That's really good point.  So when "vhost" type is created, parent 
> should assume addr of ctrl_vq is hva.
>
> Thanks


This works for vhost but not virtio since there's no way for virtio 
kernel driver to differ ctrl_vq with the rest when doing DMA map. One 
possible solution is to provide DMA domain isolation between virtqueues. 
Then ctrl vq can use its dedicated DMA domain for the work.

Anyway, this could be done in the future. We can have a version first 
that doesn't support ctrl_vq.

Thanks
Michael S. Tsirkin Oct. 25, 2019, 12:16 p.m. UTC | #14
On Fri, Oct 25, 2019 at 05:54:55PM +0800, Jason Wang wrote:
> 
> On 2019/10/24 下午6:42, Jason Wang wrote:
> > 
> > Yes.
> > 
> > 
> > >   And we should try to avoid
> > > putting ctrl vq and Rx/Tx vqs in the same DMA space to prevent
> > > guests having the chance to bypass the host (e.g. QEMU) to
> > > setup the backend accelerator directly.
> > 
> > 
> > That's really good point.  So when "vhost" type is created, parent
> > should assume addr of ctrl_vq is hva.
> > 
> > Thanks
> 
> 
> This works for vhost but not virtio since there's no way for virtio kernel
> driver to differ ctrl_vq with the rest when doing DMA map. One possible
> solution is to provide DMA domain isolation between virtqueues. Then ctrl vq
> can use its dedicated DMA domain for the work.
> 
> Anyway, this could be done in the future. We can have a version first that
> doesn't support ctrl_vq.
> 
> Thanks

Well no ctrl_vq implies either no offloads, or no XDP (since XDP needs
to disable offloads dynamically).

        if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)
            && (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
                virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
                virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) ||
                virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO) ||
                virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_CSUM))) {
                NL_SET_ERR_MSG_MOD(extack, "Can't set XDP while host is implementing LRO/CSUM, disable LRO/CSUM first");
                return -EOPNOTSUPP;
        }

neither is very attractive.

So yes ok just for development but we do need to figure out how it will
work down the road in production.

So really this specific virtio net device does not support control vq,
instead it supports a different transport specific way to send commands
to device.

Some kind of extension to the transport? Ideas?
Tiwei Bie Oct. 28, 2019, 1:58 a.m. UTC | #15
On Fri, Oct 25, 2019 at 08:16:26AM -0400, Michael S. Tsirkin wrote:
> On Fri, Oct 25, 2019 at 05:54:55PM +0800, Jason Wang wrote:
> > On 2019/10/24 下午6:42, Jason Wang wrote:
> > > 
> > > Yes.
> > > 
> > > 
> > > >   And we should try to avoid
> > > > putting ctrl vq and Rx/Tx vqs in the same DMA space to prevent
> > > > guests having the chance to bypass the host (e.g. QEMU) to
> > > > setup the backend accelerator directly.
> > > 
> > > 
> > > That's really good point.  So when "vhost" type is created, parent
> > > should assume addr of ctrl_vq is hva.
> > > 
> > > Thanks
> > 
> > 
> > This works for vhost but not virtio since there's no way for virtio kernel
> > driver to differ ctrl_vq with the rest when doing DMA map. One possible
> > solution is to provide DMA domain isolation between virtqueues. Then ctrl vq
> > can use its dedicated DMA domain for the work.

It might not be a bad idea to let the parent drivers distinguish
between virtio-mdev mdevs and vhost-mdev mdevs in ctrl-vq handling
by mdev's class id.

> > 
> > Anyway, this could be done in the future. We can have a version first that
> > doesn't support ctrl_vq.

+1, thanks

> > 
> > Thanks
> 
> Well no ctrl_vq implies either no offloads, or no XDP (since XDP needs
> to disable offloads dynamically).
> 
>         if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)
>             && (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
>                 virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
>                 virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) ||
>                 virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO) ||
>                 virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_CSUM))) {
>                 NL_SET_ERR_MSG_MOD(extack, "Can't set XDP while host is implementing LRO/CSUM, disable LRO/CSUM first");
>                 return -EOPNOTSUPP;
>         }
> 
> neither is very attractive.
> 
> So yes ok just for development but we do need to figure out how it will
> work down the road in production.

Totally agree.

> 
> So really this specific virtio net device does not support control vq,
> instead it supports a different transport specific way to send commands
> to device.
> 
> Some kind of extension to the transport? Ideas?
> 
> 
> -- 
> MST
Jason Wang Oct. 28, 2019, 3:50 a.m. UTC | #16
On 2019/10/28 上午9:58, Tiwei Bie wrote:
> On Fri, Oct 25, 2019 at 08:16:26AM -0400, Michael S. Tsirkin wrote:
>> On Fri, Oct 25, 2019 at 05:54:55PM +0800, Jason Wang wrote:
>>> On 2019/10/24 下午6:42, Jason Wang wrote:
>>>> Yes.
>>>>
>>>>
>>>>>    And we should try to avoid
>>>>> putting ctrl vq and Rx/Tx vqs in the same DMA space to prevent
>>>>> guests having the chance to bypass the host (e.g. QEMU) to
>>>>> setup the backend accelerator directly.
>>>>
>>>> That's really good point.  So when "vhost" type is created, parent
>>>> should assume addr of ctrl_vq is hva.
>>>>
>>>> Thanks
>>>
>>> This works for vhost but not virtio since there's no way for virtio kernel
>>> driver to differ ctrl_vq with the rest when doing DMA map. One possible
>>> solution is to provide DMA domain isolation between virtqueues. Then ctrl vq
>>> can use its dedicated DMA domain for the work.
> It might not be a bad idea to let the parent drivers distinguish
> between virtio-mdev mdevs and vhost-mdev mdevs in ctrl-vq handling
> by mdev's class id.


Yes, that should work, I have something probable better, see below.


>
>>> Anyway, this could be done in the future. We can have a version first that
>>> doesn't support ctrl_vq.
> +1, thanks
>
>>> Thanks
>> Well no ctrl_vq implies either no offloads, or no XDP (since XDP needs
>> to disable offloads dynamically).
>>
>>          if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)
>>              && (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
>>                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
>>                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) ||
>>                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO) ||
>>                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_CSUM))) {
>>                  NL_SET_ERR_MSG_MOD(extack, "Can't set XDP while host is implementing LRO/CSUM, disable LRO/CSUM first");
>>                  return -EOPNOTSUPP;
>>          }
>>
>> neither is very attractive.
>>
>> So yes ok just for development but we do need to figure out how it will
>> work down the road in production.
> Totally agree.
>
>> So really this specific virtio net device does not support control vq,
>> instead it supports a different transport specific way to send commands
>> to device.
>>
>> Some kind of extension to the transport? Ideas?


So it's basically an issue of isolating DMA domains. Maybe we can start 
with transport API for querying per vq DMA domain/ASID?

- for vhost-mdev, userspace can query the DMA domain for each specific 
virtqueue. For control vq, mdev can return id for software domain, for 
the rest mdev will return id of VFIO domain. Then userspace know that it 
should use different API for preparing the virtqueue, e.g for vq other 
than control vq, it should use VFIO DMA API. The control vq it should 
use hva instead.

- for virito-mdev, we can introduce per-vq DMA device, and route DMA 
mapping request for control vq back to mdev instead of the hardware. (We 
can wrap them into library or helpers to ease the development of vendor 
physical drivers).

Thanks


>>
>>
>> -- 
>> MST
Tiwei Bie Oct. 29, 2019, 9:57 a.m. UTC | #17
On Mon, Oct 28, 2019 at 11:50:49AM +0800, Jason Wang wrote:
> On 2019/10/28 上午9:58, Tiwei Bie wrote:
> > On Fri, Oct 25, 2019 at 08:16:26AM -0400, Michael S. Tsirkin wrote:
> > > On Fri, Oct 25, 2019 at 05:54:55PM +0800, Jason Wang wrote:
> > > > On 2019/10/24 下午6:42, Jason Wang wrote:
> > > > > Yes.
> > > > > 
> > > > > 
> > > > > >    And we should try to avoid
> > > > > > putting ctrl vq and Rx/Tx vqs in the same DMA space to prevent
> > > > > > guests having the chance to bypass the host (e.g. QEMU) to
> > > > > > setup the backend accelerator directly.
> > > > > 
> > > > > That's really good point.  So when "vhost" type is created, parent
> > > > > should assume addr of ctrl_vq is hva.
> > > > > 
> > > > > Thanks
> > > > 
> > > > This works for vhost but not virtio since there's no way for virtio kernel
> > > > driver to differ ctrl_vq with the rest when doing DMA map. One possible
> > > > solution is to provide DMA domain isolation between virtqueues. Then ctrl vq
> > > > can use its dedicated DMA domain for the work.
> > It might not be a bad idea to let the parent drivers distinguish
> > between virtio-mdev mdevs and vhost-mdev mdevs in ctrl-vq handling
> > by mdev's class id.
> 
> 
> Yes, that should work, I have something probable better, see below.
> 
> 
> > 
> > > > Anyway, this could be done in the future. We can have a version first that
> > > > doesn't support ctrl_vq.
> > +1, thanks
> > 
> > > > Thanks
> > > Well no ctrl_vq implies either no offloads, or no XDP (since XDP needs
> > > to disable offloads dynamically).
> > > 
> > >          if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)
> > >              && (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> > >                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
> > >                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) ||
> > >                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO) ||
> > >                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_CSUM))) {
> > >                  NL_SET_ERR_MSG_MOD(extack, "Can't set XDP while host is implementing LRO/CSUM, disable LRO/CSUM first");
> > >                  return -EOPNOTSUPP;
> > >          }
> > > 
> > > neither is very attractive.
> > > 
> > > So yes ok just for development but we do need to figure out how it will
> > > work down the road in production.
> > Totally agree.
> > 
> > > So really this specific virtio net device does not support control vq,
> > > instead it supports a different transport specific way to send commands
> > > to device.
> > > 
> > > Some kind of extension to the transport? Ideas?
> 
> 
> So it's basically an issue of isolating DMA domains. Maybe we can start with
> transport API for querying per vq DMA domain/ASID?
> 
> - for vhost-mdev, userspace can query the DMA domain for each specific
> virtqueue. For control vq, mdev can return id for software domain, for the
> rest mdev will return id of VFIO domain. Then userspace know that it should
> use different API for preparing the virtqueue, e.g for vq other than control
> vq, it should use VFIO DMA API. The control vq it should use hva instead.
> 
> - for virito-mdev, we can introduce per-vq DMA device, and route DMA mapping
> request for control vq back to mdev instead of the hardware. (We can wrap
> them into library or helpers to ease the development of vendor physical
> drivers).

Thanks for this proposal! I'm thinking about it these days.
I think it might be too complicated. I'm wondering whether we
can have something simpler. I will post a RFC patch to show
my idea today.

Thanks,
Tiwei

> 
> Thanks
> 
> 
> > > 
> > > 
> > > -- 
> > > MST
>
Jason Wang Oct. 29, 2019, 10:48 a.m. UTC | #18
On 2019/10/29 下午5:57, Tiwei Bie wrote:
> On Mon, Oct 28, 2019 at 11:50:49AM +0800, Jason Wang wrote:
>> On 2019/10/28 上午9:58, Tiwei Bie wrote:
>>> On Fri, Oct 25, 2019 at 08:16:26AM -0400, Michael S. Tsirkin wrote:
>>>> On Fri, Oct 25, 2019 at 05:54:55PM +0800, Jason Wang wrote:
>>>>> On 2019/10/24 下午6:42, Jason Wang wrote:
>>>>>> Yes.
>>>>>>
>>>>>>
>>>>>>>    And we should try to avoid
>>>>>>> putting ctrl vq and Rx/Tx vqs in the same DMA space to prevent
>>>>>>> guests having the chance to bypass the host (e.g. QEMU) to
>>>>>>> setup the backend accelerator directly.
>>>>>> That's really good point.  So when "vhost" type is created, parent
>>>>>> should assume addr of ctrl_vq is hva.
>>>>>>
>>>>>> Thanks
>>>>> This works for vhost but not virtio since there's no way for virtio kernel
>>>>> driver to differ ctrl_vq with the rest when doing DMA map. One possible
>>>>> solution is to provide DMA domain isolation between virtqueues. Then ctrl vq
>>>>> can use its dedicated DMA domain for the work.
>>> It might not be a bad idea to let the parent drivers distinguish
>>> between virtio-mdev mdevs and vhost-mdev mdevs in ctrl-vq handling
>>> by mdev's class id.
>> Yes, that should work, I have something probable better, see below.
>>
>>
>>>>> Anyway, this could be done in the future. We can have a version first that
>>>>> doesn't support ctrl_vq.
>>> +1, thanks
>>>
>>>>> Thanks
>>>> Well no ctrl_vq implies either no offloads, or no XDP (since XDP needs
>>>> to disable offloads dynamically).
>>>>
>>>>          if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)
>>>>              && (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
>>>>                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
>>>>                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) ||
>>>>                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO) ||
>>>>                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_CSUM))) {
>>>>                  NL_SET_ERR_MSG_MOD(extack, "Can't set XDP while host is implementing LRO/CSUM, disable LRO/CSUM first");
>>>>                  return -EOPNOTSUPP;
>>>>          }
>>>>
>>>> neither is very attractive.
>>>>
>>>> So yes ok just for development but we do need to figure out how it will
>>>> work down the road in production.
>>> Totally agree.
>>>
>>>> So really this specific virtio net device does not support control vq,
>>>> instead it supports a different transport specific way to send commands
>>>> to device.
>>>>
>>>> Some kind of extension to the transport? Ideas?
>> So it's basically an issue of isolating DMA domains. Maybe we can start with
>> transport API for querying per vq DMA domain/ASID?
>>
>> - for vhost-mdev, userspace can query the DMA domain for each specific
>> virtqueue. For control vq, mdev can return id for software domain, for the
>> rest mdev will return id of VFIO domain. Then userspace know that it should
>> use different API for preparing the virtqueue, e.g for vq other than control
>> vq, it should use VFIO DMA API. The control vq it should use hva instead.
>>
>> - for virito-mdev, we can introduce per-vq DMA device, and route DMA mapping
>> request for control vq back to mdev instead of the hardware. (We can wrap
>> them into library or helpers to ease the development of vendor physical
>> drivers).
> Thanks for this proposal! I'm thinking about it these days.
> I think it might be too complicated. I'm wondering whether we
> can have something simpler. I will post a RFC patch to show
> my idea today.


Thanks, will check.

Btw, for virtio-mdev, the change should be very minimal, will post an
RFC as well. For vhost-mdev, it could be just a helper to return an ID
for DMA domain like ID_VFIO or ID_HVA.

Or a more straightforward way is to force queues like control vq to use PA.


>
> Thanks,
> Tiwei
>
Tiwei Bie Oct. 30, 2019, 1:27 a.m. UTC | #19
On Tue, Oct 29, 2019 at 06:48:27PM +0800, Jason Wang wrote:
> On 2019/10/29 下午5:57, Tiwei Bie wrote:
> > On Mon, Oct 28, 2019 at 11:50:49AM +0800, Jason Wang wrote:
> >> On 2019/10/28 上午9:58, Tiwei Bie wrote:
> >>> On Fri, Oct 25, 2019 at 08:16:26AM -0400, Michael S. Tsirkin wrote:
> >>>> On Fri, Oct 25, 2019 at 05:54:55PM +0800, Jason Wang wrote:
> >>>>> On 2019/10/24 下午6:42, Jason Wang wrote:
> >>>>>> Yes.
> >>>>>>
> >>>>>>
> >>>>>>>    And we should try to avoid
> >>>>>>> putting ctrl vq and Rx/Tx vqs in the same DMA space to prevent
> >>>>>>> guests having the chance to bypass the host (e.g. QEMU) to
> >>>>>>> setup the backend accelerator directly.
> >>>>>> That's really good point.  So when "vhost" type is created, parent
> >>>>>> should assume addr of ctrl_vq is hva.
> >>>>>>
> >>>>>> Thanks
> >>>>> This works for vhost but not virtio since there's no way for virtio kernel
> >>>>> driver to differ ctrl_vq with the rest when doing DMA map. One possible
> >>>>> solution is to provide DMA domain isolation between virtqueues. Then ctrl vq
> >>>>> can use its dedicated DMA domain for the work.
> >>> It might not be a bad idea to let the parent drivers distinguish
> >>> between virtio-mdev mdevs and vhost-mdev mdevs in ctrl-vq handling
> >>> by mdev's class id.
> >> Yes, that should work, I have something probable better, see below.
> >>
> >>
> >>>>> Anyway, this could be done in the future. We can have a version first that
> >>>>> doesn't support ctrl_vq.
> >>> +1, thanks
> >>>
> >>>>> Thanks
> >>>> Well no ctrl_vq implies either no offloads, or no XDP (since XDP needs
> >>>> to disable offloads dynamically).
> >>>>
> >>>>          if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)
> >>>>              && (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> >>>>                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
> >>>>                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) ||
> >>>>                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO) ||
> >>>>                  virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_CSUM))) {
> >>>>                  NL_SET_ERR_MSG_MOD(extack, "Can't set XDP while host is implementing LRO/CSUM, disable LRO/CSUM first");
> >>>>                  return -EOPNOTSUPP;
> >>>>          }
> >>>>
> >>>> neither is very attractive.
> >>>>
> >>>> So yes ok just for development but we do need to figure out how it will
> >>>> work down the road in production.
> >>> Totally agree.
> >>>
> >>>> So really this specific virtio net device does not support control vq,
> >>>> instead it supports a different transport specific way to send commands
> >>>> to device.
> >>>>
> >>>> Some kind of extension to the transport? Ideas?
> >> So it's basically an issue of isolating DMA domains. Maybe we can start with
> >> transport API for querying per vq DMA domain/ASID?
> >>
> >> - for vhost-mdev, userspace can query the DMA domain for each specific
> >> virtqueue. For control vq, mdev can return id for software domain, for the
> >> rest mdev will return id of VFIO domain. Then userspace know that it should
> >> use different API for preparing the virtqueue, e.g for vq other than control
> >> vq, it should use VFIO DMA API. The control vq it should use hva instead.
> >>
> >> - for virito-mdev, we can introduce per-vq DMA device, and route DMA mapping
> >> request for control vq back to mdev instead of the hardware. (We can wrap
> >> them into library or helpers to ease the development of vendor physical
> >> drivers).
> > Thanks for this proposal! I'm thinking about it these days.
> > I think it might be too complicated. I'm wondering whether we
> > can have something simpler. I will post a RFC patch to show
> > my idea today.
> 
> 
> Thanks, will check.
> 
> Btw, for virtio-mdev, the change should be very minimal, will post an
> RFC as well. For vhost-mdev, it could be just a helper to return an ID
> for DMA domain like ID_VFIO or ID_HVA.
> 
> Or a more straightforward way is to force queues like control vq to use PA.

Will check. Thanks!

> 
> 
> >
> > Thanks,
> > Tiwei
> >
>
diff mbox series

Patch

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 5834f6b7c7a5..2963f65e6648 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -69,6 +69,18 @@  void mdev_set_virtio_ops(struct mdev_device *mdev,
 }
 EXPORT_SYMBOL(mdev_set_virtio_ops);
 
+/* Specify the vhost device ops for the mdev device, this
+ * must be called during create() callback for vhost mdev device.
+ */
+void mdev_set_vhost_ops(struct mdev_device *mdev,
+			const struct virtio_mdev_device_ops *vhost_ops)
+{
+	WARN_ON(mdev->class_id);
+	mdev->class_id = MDEV_CLASS_ID_VHOST;
+	mdev->device_ops = vhost_ops;
+}
+EXPORT_SYMBOL(mdev_set_vhost_ops);
+
 const void *mdev_get_dev_ops(struct mdev_device *mdev)
 {
 	return mdev->device_ops;
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 3d03ccbd1adc..7b5c2f655af7 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -34,6 +34,15 @@  config VHOST_VSOCK
 	To compile this driver as a module, choose M here: the module will be called
 	vhost_vsock.
 
+config VHOST_MDEV
+	tristate "Vhost driver for Mediated devices"
+	depends on EVENTFD && VFIO && VFIO_MDEV
+	select VHOST
+	default n
+	---help---
+	Say M here to enable the vhost_mdev module for use with
+	the mediated device based hardware vhost accelerators.
+
 config VHOST
 	tristate
 	---help---
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 6c6df24f770c..ad9c0f8c6d8c 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -10,4 +10,7 @@  vhost_vsock-y := vsock.o
 
 obj-$(CONFIG_VHOST_RING) += vringh.o
 
+obj-$(CONFIG_VHOST_MDEV) += vhost_mdev.o
+vhost_mdev-y := mdev.o
+
 obj-$(CONFIG_VHOST)	+= vhost.o
diff --git a/drivers/vhost/mdev.c b/drivers/vhost/mdev.c
new file mode 100644
index 000000000000..5f9cae61018c
--- /dev/null
+++ b/drivers/vhost/mdev.c
@@ -0,0 +1,415 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018-2019 Intel Corporation.
+ */
+
+#include <linux/compat.h>
+#include <linux/kernel.h>
+#include <linux/miscdevice.h>
+#include <linux/mdev.h>
+#include <linux/module.h>
+#include <linux/vfio.h>
+#include <linux/vhost.h>
+#include <linux/virtio_mdev.h>
+#include <linux/virtio_ids.h>
+
+#include "vhost.h"
+
+/* Currently, only network backend w/o multiqueue is supported. */
+#define VHOST_MDEV_VQ_MAX	2
+
+struct vhost_mdev {
+	/* The lock is to protect this structure. */
+	struct mutex mutex;
+	struct vhost_dev dev;
+	struct vhost_virtqueue *vqs;
+	int nvqs;
+	u64 status;
+	u64 features;
+	u64 acked_features;
+	bool opened;
+	struct mdev_device *mdev;
+};
+
+static void handle_vq_kick(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+						  poll.work);
+	struct vhost_mdev *m = container_of(vq->dev, struct vhost_mdev, dev);
+	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
+
+	ops->kick_vq(m->mdev, vq - m->vqs);
+}
+
+static irqreturn_t vhost_mdev_virtqueue_cb(void *private)
+{
+	struct vhost_virtqueue *vq = private;
+	struct eventfd_ctx *call_ctx = vq->call_ctx;
+
+	if (call_ctx)
+		eventfd_signal(call_ctx, 1);
+	return IRQ_HANDLED;
+}
+
+static void vhost_mdev_reset(struct vhost_mdev *m)
+{
+	struct mdev_device *mdev = m->mdev;
+	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev);
+
+	m->status = 0;
+	return ops->set_status(mdev, m->status);
+}
+
+static long vhost_mdev_get_status(struct vhost_mdev *m, u8 __user *statusp)
+{
+	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
+	struct mdev_device *mdev = m->mdev;
+	u8 status;
+
+	status = ops->get_status(mdev);
+	m->status = status;
+
+	if (copy_to_user(statusp, &status, sizeof(status)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static long vhost_mdev_set_status(struct vhost_mdev *m, u8 __user *statusp)
+{
+	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
+	struct mdev_device *mdev = m->mdev;
+	u8 status;
+
+	if (copy_from_user(&status, statusp, sizeof(status)))
+		return -EFAULT;
+
+	/*
+	 * Userspace shouldn't remove status bits unless reset the
+	 * status to 0.
+	 */
+	if (status != 0 && (m->status & ~status) != 0)
+		return -EINVAL;
+
+	ops->set_status(mdev, status);
+	m->status = ops->get_status(mdev);
+
+	return 0;
+}
+
+static long vhost_mdev_get_features(struct vhost_mdev *m, u64 __user *featurep)
+{
+	if (copy_to_user(featurep, &m->features, sizeof(m->features)))
+		return -EFAULT;
+	return 0;
+}
+
+static long vhost_mdev_set_features(struct vhost_mdev *m, u64 __user *featurep)
+{
+	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
+	struct mdev_device *mdev = m->mdev;
+	u64 features;
+
+	/*
+	 * It's not allowed to change the features after they have
+	 * been negotiated.
+	 */
+	if (m->status & VIRTIO_CONFIG_S_FEATURES_OK)
+		return -EPERM;
+
+	if (copy_from_user(&features, featurep, sizeof(features)))
+		return -EFAULT;
+
+	if (features & ~m->features)
+		return -EINVAL;
+
+	m->acked_features = features;
+	if (ops->set_features(mdev, m->acked_features))
+		return -ENODEV;
+
+	return 0;
+}
+
+static long vhost_mdev_get_vring_num(struct vhost_mdev *m, u16 __user *argp)
+{
+	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
+	struct mdev_device *mdev = m->mdev;
+	u16 num;
+
+	num = ops->get_vq_num_max(mdev);
+
+	if (copy_to_user(argp, &num, sizeof(num)))
+		return -EFAULT;
+	return 0;
+}
+
+static long vhost_mdev_get_queue_num(struct vhost_mdev *m, u32 __user *argp)
+{
+	u32 nvqs = m->nvqs;
+
+	if (copy_to_user(argp, &nvqs, sizeof(nvqs)))
+		return -EFAULT;
+	return 0;
+}
+
+static long vhost_mdev_vring_ioctl(struct vhost_mdev *m, unsigned int cmd,
+				   void __user *argp)
+{
+	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(m->mdev);
+	struct mdev_device *mdev = m->mdev;
+	struct virtio_mdev_callback cb;
+	struct vhost_virtqueue *vq;
+	struct vhost_vring_state s;
+	u32 idx;
+	long r;
+
+	r = get_user(idx, (u32 __user *)argp);
+	if (r < 0)
+		return r;
+	if (idx >= m->nvqs)
+		return -ENOBUFS;
+
+	/*
+	 * It's not allowed to detect and program vqs before
+	 * features negotiation or after enabling driver.
+	 */
+	if (!(m->status & VIRTIO_CONFIG_S_FEATURES_OK) ||
+	    (m->status & VIRTIO_CONFIG_S_DRIVER_OK))
+		return -EPERM;
+
+	vq = &m->vqs[idx];
+
+	if (cmd == VHOST_MDEV_SET_VRING_ENABLE) {
+		if (copy_from_user(&s, argp, sizeof(s)))
+			return -EFAULT;
+		ops->set_vq_ready(mdev, idx, s.num);
+		return 0;
+	}
+
+	/*
+	 * It's not allowed to detect and program vqs with
+	 * vqs enabled.
+	 */
+	if (ops->get_vq_ready(mdev, idx))
+		return -EPERM;
+
+	if (cmd == VHOST_GET_VRING_BASE)
+		vq->last_avail_idx = ops->get_vq_state(m->mdev, idx);
+
+	r = vhost_vring_ioctl(&m->dev, cmd, argp);
+	if (r)
+		return r;
+
+	switch (cmd) {
+	case VHOST_SET_VRING_ADDR:
+		/*
+		 * In vhost-mdev, the ring addresses set by userspace should
+		 * be the DMA addresses within the VFIO container/group.
+		 */
+		if (ops->set_vq_address(mdev, idx, (u64)vq->desc,
+					(u64)vq->avail, (u64)vq->used))
+			r = -ENODEV;
+		break;
+
+	case VHOST_SET_VRING_BASE:
+		if (ops->set_vq_state(mdev, idx, vq->last_avail_idx))
+			r = -ENODEV;
+		break;
+
+	case VHOST_SET_VRING_CALL:
+		if (vq->call_ctx) {
+			cb.callback = vhost_mdev_virtqueue_cb;
+			cb.private = vq;
+		} else {
+			cb.callback = NULL;
+			cb.private = NULL;
+		}
+		ops->set_vq_cb(mdev, idx, &cb);
+		break;
+
+	case VHOST_SET_VRING_NUM:
+		ops->set_vq_num(mdev, idx, vq->num);
+		break;
+	}
+
+	return r;
+}
+
+static int vhost_mdev_open(void *device_data)
+{
+	struct vhost_mdev *m = device_data;
+	struct vhost_dev *dev;
+	struct vhost_virtqueue **vqs;
+	int nvqs, i, r;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&m->mutex);
+
+	if (m->opened) {
+		r = -EBUSY;
+		goto err;
+	}
+
+	nvqs = m->nvqs;
+	vhost_mdev_reset(m);
+
+	memset(&m->dev, 0, sizeof(m->dev));
+	memset(m->vqs, 0, nvqs * sizeof(struct vhost_virtqueue));
+
+	vqs = kmalloc_array(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs) {
+		r = -ENOMEM;
+		goto err;
+	}
+
+	dev = &m->dev;
+	for (i = 0; i < nvqs; i++) {
+		vqs[i] = &m->vqs[i];
+		vqs[i]->handle_kick = handle_vq_kick;
+	}
+	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0);
+	m->opened = true;
+	mutex_unlock(&m->mutex);
+
+	return 0;
+
+err:
+	mutex_unlock(&m->mutex);
+	module_put(THIS_MODULE);
+	return r;
+}
+
+static void vhost_mdev_release(void *device_data)
+{
+	struct vhost_mdev *m = device_data;
+
+	mutex_lock(&m->mutex);
+	vhost_mdev_reset(m);
+	vhost_dev_stop(&m->dev);
+	vhost_dev_cleanup(&m->dev);
+
+	kfree(m->dev.vqs);
+	m->opened = false;
+	mutex_unlock(&m->mutex);
+	module_put(THIS_MODULE);
+}
+
+static long vhost_mdev_unlocked_ioctl(void *device_data,
+				      unsigned int cmd, unsigned long arg)
+{
+	struct vhost_mdev *m = device_data;
+	void __user *argp = (void __user *)arg;
+	long r;
+
+	mutex_lock(&m->mutex);
+
+	switch (cmd) {
+	case VHOST_MDEV_GET_STATUS:
+		r = vhost_mdev_get_status(m, argp);
+		break;
+	case VHOST_MDEV_SET_STATUS:
+		r = vhost_mdev_set_status(m, argp);
+		break;
+	case VHOST_GET_FEATURES:
+		r = vhost_mdev_get_features(m, argp);
+		break;
+	case VHOST_SET_FEATURES:
+		r = vhost_mdev_set_features(m, argp);
+		break;
+	case VHOST_MDEV_GET_VRING_NUM:
+		r = vhost_mdev_get_vring_num(m, argp);
+		break;
+	case VHOST_MDEV_GET_QUEUE_NUM:
+		r = vhost_mdev_get_queue_num(m, argp);
+		break;
+	default:
+		r = vhost_dev_ioctl(&m->dev, cmd, argp);
+		if (r == -ENOIOCTLCMD)
+			r = vhost_mdev_vring_ioctl(m, cmd, argp);
+	}
+
+	mutex_unlock(&m->mutex);
+	return r;
+}
+
+static const struct vfio_device_ops vfio_vhost_mdev_dev_ops = {
+	.name		= "vfio-vhost-mdev",
+	.open		= vhost_mdev_open,
+	.release	= vhost_mdev_release,
+	.ioctl		= vhost_mdev_unlocked_ioctl,
+};
+
+static int vhost_mdev_probe(struct device *dev)
+{
+	struct mdev_device *mdev = mdev_from_dev(dev);
+	const struct virtio_mdev_device_ops *ops = mdev_get_dev_ops(mdev);
+	struct vhost_mdev *m;
+	int nvqs, r;
+
+	/* Currently, only network backend is supported. */
+	if (ops->get_device_id(mdev) != VIRTIO_ID_NET)
+		return -ENOTSUPP;
+
+	if (ops->get_mdev_features(mdev) != VIRTIO_MDEV_F_VERSION_1)
+		return -ENOTSUPP;
+
+	m = devm_kzalloc(dev, sizeof(*m), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	if (!m)
+		return -ENOMEM;
+
+	nvqs = VHOST_MDEV_VQ_MAX;
+	m->nvqs = nvqs;
+
+	m->vqs = devm_kmalloc_array(dev, nvqs, sizeof(struct vhost_virtqueue),
+				    GFP_KERNEL);
+	if (!m->vqs)
+		return -ENOMEM;
+
+	r = vfio_add_group_dev(dev, &vfio_vhost_mdev_dev_ops, m);
+	if (r)
+		return r;
+
+	mutex_init(&m->mutex);
+	m->features = ops->get_features(mdev);
+	m->mdev = mdev;
+	return 0;
+}
+
+static void vhost_mdev_remove(struct device *dev)
+{
+	struct vhost_mdev *m;
+
+	m = vfio_del_group_dev(dev);
+	mutex_destroy(&m->mutex);
+}
+
+static const struct mdev_class_id vhost_mdev_match[] = {
+	{ MDEV_CLASS_ID_VHOST },
+	{ 0 },
+};
+MODULE_DEVICE_TABLE(mdev, vhost_mdev_match);
+
+static struct mdev_driver vhost_mdev_driver = {
+	.name	= "vhost_mdev",
+	.probe	= vhost_mdev_probe,
+	.remove	= vhost_mdev_remove,
+	.id_table = vhost_mdev_match,
+};
+
+static int __init vhost_mdev_init(void)
+{
+	return mdev_register_driver(&vhost_mdev_driver, THIS_MODULE);
+}
+module_init(vhost_mdev_init);
+
+static void __exit vhost_mdev_exit(void)
+{
+	mdev_unregister_driver(&vhost_mdev_driver);
+}
+module_exit(vhost_mdev_exit);
+
+MODULE_VERSION("0.0.1");
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("Mediated device based accelerator for virtio");
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index 13e045e09d3b..6060cdbe6d3e 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -114,6 +114,8 @@  void mdev_set_vfio_ops(struct mdev_device *mdev,
 		       const struct vfio_mdev_device_ops *vfio_ops);
 void mdev_set_virtio_ops(struct mdev_device *mdev,
                          const struct virtio_mdev_device_ops *virtio_ops);
+void mdev_set_vhost_ops(struct mdev_device *mdev,
+			const struct virtio_mdev_device_ops *vhost_ops);
 const void *mdev_get_dev_ops(struct mdev_device *mdev);
 
 extern struct bus_type mdev_bus_type;
@@ -131,6 +133,7 @@  struct mdev_device *mdev_from_dev(struct device *dev);
 enum {
 	MDEV_CLASS_ID_VFIO = 1,
 	MDEV_CLASS_ID_VIRTIO = 2,
+	MDEV_CLASS_ID_VHOST = 3,
 	/* New entries must be added here */
 };
 
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 40d028eed645..dad3c62bd91b 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -116,4 +116,17 @@ 
 #define VHOST_VSOCK_SET_GUEST_CID	_IOW(VHOST_VIRTIO, 0x60, __u64)
 #define VHOST_VSOCK_SET_RUNNING		_IOW(VHOST_VIRTIO, 0x61, int)
 
+/* VHOST_MDEV specific defines */
+
+/* Get and set the status of the backend. The status bits follow the
+ * same definition of the device status defined in virtio-spec. */
+#define VHOST_MDEV_GET_STATUS		_IOW(VHOST_VIRTIO, 0x70, __u8)
+#define VHOST_MDEV_SET_STATUS		_IOW(VHOST_VIRTIO, 0x71, __u8)
+/* Enable/disable the ring. */
+#define VHOST_MDEV_SET_VRING_ENABLE	_IOW(VHOST_VIRTIO, 0x72, struct vhost_vring_state)
+/* Get the max ring size. */
+#define VHOST_MDEV_GET_VRING_NUM	_IOW(VHOST_VIRTIO, 0x73, __u16)
+/* Get the max number of queues. */
+#define VHOST_MDEV_GET_QUEUE_NUM	_IOW(VHOST_VIRTIO, 0x74, __u32)
+
 #endif