[V3,11/11] vhost: allow userspace to create workers

Message ID	20211022051911.108383-13-michael.christie@oracle.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <target-devel-owner@kernel.org> From: Mike Christie <michael.christie@oracle.com> To: target-devel@vger.kernel.org, linux-scsi@vger.kernel.org, stefanha@redhat.com, pbonzini@redhat.com, jasowang@redhat.com, mst@redhat.com, sgarzare@redhat.com, virtualization@lists.linux-foundation.org Cc: Mike Christie <michael.christie@oracle.com> Subject: [PATCH V3 11/11] vhost: allow userspace to create workers Date: Fri, 22 Oct 2021 00:19:11 -0500 Message-Id: <20211022051911.108383-13-michael.christie@oracle.com> In-Reply-To: <20211022051911.108383-1-michael.christie@oracle.com> References: <20211022051911.108383-1-michael.christie@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	vhost: multiple worker support \| expand [V3,00/11] vhost: multiple worker support QEMU vhost-scsi: add support for VHOST_SET_VRING_WORKER [V3,02/11] vhost, vhost-net: add helper to check if vq has work [V3,03/11] vhost: take worker or vq instead of dev for queueing [V3,04/11] vhost: take worker or vq instead of dev for flushing [V3,05/11] vhost: convert poll work to be vq based [V3,06/11] vhost-sock: convert to vq helpers [V3,07/11] vhost-scsi: make SCSI cmd completion per vq [V3,08/11] vhost-scsi: convert to vq helpers [V3,09/11] vhost-scsi: flush IO vqs then send TMF rsp [V3,10/11] vhost: remove device wide queu/flushing helpers [V3,11/11] vhost: allow userspace to create workers

Mike Christie Oct. 22, 2021, 5:19 a.m. UTC

This patch allows userspace to create workers and bind them to vqs. You
can have N workers per dev and also share N workers with M vqs.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/vhost.c            | 99 ++++++++++++++++++++++++++++----
 drivers/vhost/vhost.h            |  2 +-
 include/uapi/linux/vhost.h       | 11 ++++
 include/uapi/linux/vhost_types.h | 12 ++++
 4 files changed, 112 insertions(+), 12 deletions(-)

Michael S. Tsirkin Oct. 22, 2021, 10:47 a.m. UTC | #1

On Fri, Oct 22, 2021 at 12:19:11AM -0500, Mike Christie wrote:
> This patch allows userspace to create workers and bind them to vqs. You
> can have N workers per dev and also share N workers with M vqs.
> 
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/vhost/vhost.c            | 99 ++++++++++++++++++++++++++++----
>  drivers/vhost/vhost.h            |  2 +-
>  include/uapi/linux/vhost.h       | 11 ++++
>  include/uapi/linux/vhost_types.h | 12 ++++
>  4 files changed, 112 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 04f43a6445e1..c86e88d7f35c 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -493,7 +493,6 @@ void vhost_dev_init(struct vhost_dev *dev,
>  	dev->umem = NULL;
>  	dev->iotlb = NULL;
>  	dev->mm = NULL;
> -	dev->worker = NULL;
>  	dev->iov_limit = iov_limit;
>  	dev->weight = weight;
>  	dev->byte_weight = byte_weight;
> @@ -576,20 +575,40 @@ static void vhost_worker_stop(struct vhost_worker *worker)
>  	wait_for_completion(worker->exit_done);
>  }
>  
> -static void vhost_worker_free(struct vhost_dev *dev)
> -{
> -	struct vhost_worker *worker = dev->worker;
>  
> +static void vhost_worker_put(struct vhost_worker *worker)
> +{
>  	if (!worker)
>  		return;
>  
> -	dev->worker = NULL;
> +	if (!refcount_dec_and_test(&worker->refcount))
> +		return;
> +
>  	WARN_ON(!llist_empty(&worker->work_list));
>  	vhost_worker_stop(worker);
>  	kfree(worker);
>  }
>  
> -static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
> +static void vhost_vq_clear_worker(struct vhost_virtqueue *vq)
> +{
> +	if (vq->worker)
> +		vhost_worker_put(vq->worker);
> +	vq->worker = NULL;
> +}
> +
> +static void vhost_workers_free(struct vhost_dev *dev)
> +{
> +	int i;
> +
> +	if (!dev->use_worker)
> +		return;
> +
> +	for (i = 0; i < dev->nvqs; i++)
> +		vhost_vq_clear_worker(dev->vqs[i]);
> +}
> +
> +static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev,
> +						int init_vq_map_count)
>  {
>  	struct vhost_worker *worker;
>  	struct task_struct *task;
> @@ -598,9 +617,9 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
>  	if (!worker)
>  		return NULL;
>  
> -	dev->worker = worker;
>  	worker->kcov_handle = kcov_common_handle();
>  	init_llist_head(&worker->work_list);
> +	refcount_set(&worker->refcount, init_vq_map_count);
>  
>  	/*
>  	 * vhost used to use the kthread API which ignores all signals by
> @@ -617,10 +636,58 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
>  
>  free_worker:
>  	kfree(worker);
> -	dev->worker = NULL;
>  	return NULL;
>  }
>  
> +static struct vhost_worker *vhost_worker_find(struct vhost_dev *dev, pid_t pid)
> +{
> +	struct vhost_worker *worker = NULL;
> +	int i;
> +
> +	for (i = 0; i < dev->nvqs; i++) {
> +		if (dev->vqs[i]->worker->task->pid != pid)
> +			continue;
> +
> +		worker = dev->vqs[i]->worker;
> +		break;
> +	}
> +
> +	return worker;
> +}
> +
> +/* Caller must have device mutex */
> +static int vhost_vq_setup_worker(struct vhost_virtqueue *vq,
> +				 struct vhost_vring_worker *info)
> +{
> +	struct vhost_dev *dev = vq->dev;
> +	struct vhost_worker *worker;
> +
> +	if (!dev->use_worker)
> +		return -EINVAL;
> +
> +	/* We don't support setting a worker on an active vq */
> +	if (vq->private_data)
> +		return -EBUSY;
> +
> +	if (info->pid == VHOST_VRING_NEW_WORKER) {
> +		worker = vhost_worker_create(dev, 1);
> +		if (!worker)
> +			return -ENOMEM;
> +
> +		info->pid = worker->task->pid;
> +	} else {
> +		worker = vhost_worker_find(dev, info->pid);
> +		if (!worker)
> +			return -ENODEV;
> +
> +		refcount_inc(&worker->refcount);
> +	}
> +
> +	vhost_vq_clear_worker(vq);
> +	vq->worker = worker;
> +	return 0;
> +}
> +
>  /* Caller should have device mutex */
>  long vhost_dev_set_owner(struct vhost_dev *dev)
>  {
> @@ -636,7 +703,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>  	vhost_attach_mm(dev);
>  
>  	if (dev->use_worker) {
> -		worker = vhost_worker_create(dev);
> +		worker = vhost_worker_create(dev, dev->nvqs);
>  		if (!worker)
>  			goto err_worker;
>  
> @@ -650,7 +717,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>  
>  	return 0;
>  err_iovecs:
> -	vhost_worker_free(dev);
> +	vhost_workers_free(dev);
>  err_worker:
>  	vhost_detach_mm(dev);
>  err_mm:
> @@ -742,7 +809,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>  	dev->iotlb = NULL;
>  	vhost_clear_msg(dev);
>  	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
> -	vhost_worker_free(dev);
> +	vhost_workers_free(dev);
>  	vhost_detach_mm(dev);
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
> @@ -1612,6 +1679,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>  	struct eventfd_ctx *ctx = NULL;
>  	u32 __user *idxp = argp;
>  	struct vhost_virtqueue *vq;
> +	struct vhost_vring_worker w;
>  	struct vhost_vring_state s;
>  	struct vhost_vring_file f;
>  	u32 idx;
> @@ -1719,6 +1787,15 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>  		if (copy_to_user(argp, &s, sizeof(s)))
>  			r = -EFAULT;
>  		break;
> +	case VHOST_SET_VRING_WORKER:
> +		if (copy_from_user(&w, argp, sizeof(w))) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		r = vhost_vq_setup_worker(vq, &w);
> +		if (!r && copy_to_user(argp, &w, sizeof(w)))
> +			r = -EFAULT;
> +		break;
>  	default:
>  		r = -ENOIOCTLCMD;
>  	}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 33c63b24187a..0911d1a9bd3b 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -35,6 +35,7 @@ struct vhost_worker {
>  	struct llist_head	work_list;
>  	u64			kcov_handle;
>  	unsigned long		flags;
> +	refcount_t		refcount;
>  };
>  
>  /* Poll a file (eventfd or socket) */
> @@ -160,7 +161,6 @@ struct vhost_dev {
>  	struct vhost_virtqueue **vqs;
>  	int nvqs;
>  	struct eventfd_ctx *log_ctx;
> -	struct vhost_worker *worker;
>  	struct vhost_iotlb *umem;
>  	struct vhost_iotlb *iotlb;
>  	spinlock_t iotlb_lock;
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index c998860d7bbc..e5c0669430e5 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -70,6 +70,17 @@
>  #define VHOST_VRING_BIG_ENDIAN 1
>  #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
>  #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> +/* By default, a device gets one vhost_worker created during VHOST_SET_OWNER
> + * that its virtqueues share. This allows userspace to create a vhost_worker
> + * and map a virtqueue to it or map a virtqueue to an existing worker.
> + *
> + * If pid > 0 and it matches an existing vhost_worker thread it will be bound
> + * to the vq. If pid is VHOST_VRING_NEW_WORKER, then a new worker will be
> + * created and bound to the vq.
> + *
> + * This must be called after VHOST_SET_OWNER and before the vq is active.
> + */

A couple of things here:
it's probably a good idea not to make it match pid exactly,
if for no other reason than I'm not sure we want to
commit this being a pid. Let's just call it an id?
And maybe byteswap it or xor with some value
just to make sure userspace does not begin abusing it anyway.

Also, interaction with pid namespace is unclear to me.
Can you document what happens here?
No need to fix funky things like moving the fd between
pid namespaces while also creating/destroying workers, but let's
document it's not supported.


> +#define VHOST_SET_VRING_WORKER _IOWR(VHOST_VIRTIO, 0x15, struct vhost_vring_worker)
>  
>  /* The following ioctls use eventfd file descriptors to signal and poll
>   * for events. */
> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> index f7f6a3a28977..af654e3cef0e 100644
> --- a/include/uapi/linux/vhost_types.h
> +++ b/include/uapi/linux/vhost_types.h
> @@ -47,6 +47,18 @@ struct vhost_vring_addr {
>  	__u64 log_guest_addr;
>  };
>  
> +#define VHOST_VRING_NEW_WORKER -1
> +
> +struct vhost_vring_worker {
> +	unsigned int index;
> +	/*
> +	 * The pid of the vhost worker that the vq will be bound to. If
> +	 * pid is VHOST_VRING_NEW_WORKER a new worker will be created and its
> +	 * pid will be returned in pid.
> +	 */
> +	__kernel_pid_t pid;
> +};
> +
>  /* no alignment requirement */
>  struct vhost_iotlb_msg {
>  	__u64 iova;
> -- 
> 2.25.1

Mike Christie Oct. 22, 2021, 4:12 p.m. UTC | #2

On 10/22/21 5:47 AM, Michael S. Tsirkin wrote:
>> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
>> index c998860d7bbc..e5c0669430e5 100644
>> --- a/include/uapi/linux/vhost.h
>> +++ b/include/uapi/linux/vhost.h
>> @@ -70,6 +70,17 @@
>>  #define VHOST_VRING_BIG_ENDIAN 1
>>  #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
>>  #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
>> +/* By default, a device gets one vhost_worker created during VHOST_SET_OWNER
>> + * that its virtqueues share. This allows userspace to create a vhost_worker
>> + * and map a virtqueue to it or map a virtqueue to an existing worker.
>> + *
>> + * If pid > 0 and it matches an existing vhost_worker thread it will be bound
>> + * to the vq. If pid is VHOST_VRING_NEW_WORKER, then a new worker will be
>> + * created and bound to the vq.
>> + *
>> + * This must be called after VHOST_SET_OWNER and before the vq is active.
>> + */
> 
> A couple of things here:
> it's probably a good idea not to make it match pid exactly,
> if for no other reason than I'm not sure we want to
> commit this being a pid. Let's just call it an id?

Ok.

> And maybe byteswap it or xor with some value
> just to make sure userspace does not begin abusing it anyway.
> 
> Also, interaction with pid namespace is unclear to me.
> Can you document what happens here?

This current patchset only allows the vhost_dev owner to
create/bind workers for devices it owns, so namespace don't come
into play. If a thread from another namespace tried to create/bind
a worker we would hit the owner checks in vhost_dev_ioctl which is
done before vhost_vring_ioctl normally (for vdpa we hit the use_worker
check and fail there).

However, with the kernel worker API changes the worker threads will
now be in the vhost dev owner's namespace and not the kthreadd/default
one, so in the future we are covered if we want to do something more
advanced. For example, I've seen people working on an API to export the
worker pids:

https://lore.kernel.org/netdev/20210507154332.hiblsd6ot5wzwkdj@steredhat/T/

and in the future for interfaces that export that info we could restrict
access to root or users from the same namespace or I guess add interfaces
to allow different namespaces to see the workers and share them.

> No need to fix funky things like moving the fd between
> pid namespaces while also creating/destroying workers, but let's
> document it's not supported.

Ok. I'll add a comment.

Mike Christie Oct. 22, 2021, 6:17 p.m. UTC | #3

On 10/22/21 11:12 AM, michael.christie@oracle.com wrote:
> On 10/22/21 5:47 AM, Michael S. Tsirkin wrote:
>>> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
>>> index c998860d7bbc..e5c0669430e5 100644
>>> --- a/include/uapi/linux/vhost.h
>>> +++ b/include/uapi/linux/vhost.h
>>> @@ -70,6 +70,17 @@
>>>  #define VHOST_VRING_BIG_ENDIAN 1
>>>  #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
>>>  #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
>>> +/* By default, a device gets one vhost_worker created during VHOST_SET_OWNER
>>> + * that its virtqueues share. This allows userspace to create a vhost_worker
>>> + * and map a virtqueue to it or map a virtqueue to an existing worker.
>>> + *
>>> + * If pid > 0 and it matches an existing vhost_worker thread it will be bound
>>> + * to the vq. If pid is VHOST_VRING_NEW_WORKER, then a new worker will be
>>> + * created and bound to the vq.
>>> + *
>>> + * This must be called after VHOST_SET_OWNER and before the vq is active.
>>> + */
>>
>> A couple of things here:
>> it's probably a good idea not to make it match pid exactly,
>> if for no other reason than I'm not sure we want to
>> commit this being a pid. Let's just call it an id?
> 
> Ok.
> 
>> And maybe byteswap it or xor with some value
>> just to make sure userspace does not begin abusing it anyway.
>>
>> Also, interaction with pid namespace is unclear to me.
>> Can you document what happens here?
> 
> This current patchset only allows the vhost_dev owner to
> create/bind workers for devices it owns, so namespace don't come

I made a mistake here. The patches do restrict VHOST_SET_VRING_WORKER
to the same owner like I wrote. However, it looks like we could have 2
threads with the same mm pointer so vhost_dev_check_owner returns true,
but they could be in different namespaces.

Even though we are not going to pass the pid_t between user/kernel
space, should I add a pid namespace check when I repost the patches?



> into play. If a thread from another namespace tried to create/bind
> a worker we would hit the owner checks in vhost_dev_ioctl which is
> done before vhost_vring_ioctl normally (for vdpa we hit the use_worker
> check and fail there).
> 
> However, with the kernel worker API changes the worker threads will
> now be in the vhost dev owner's namespace and not the kthreadd/default
> one, so in the future we are covered if we want to do something more
> advanced. For example, I've seen people working on an API to export the
> worker pids:
> 
> https://lore.kernel.org/netdev/20210507154332.hiblsd6ot5wzwkdj@steredhat/T/
> 
> and in the future for interfaces that export that info we could restrict
> access to root or users from the same namespace or I guess add interfaces
> to allow different namespaces to see the workers and share them.
> 
> 
>> No need to fix funky things like moving the fd between
>> pid namespaces while also creating/destroying workers, but let's
>> document it's not supported.
> 
> Ok. I'll add a comment.
>

Michael S. Tsirkin Oct. 23, 2021, 8:11 p.m. UTC | #4

On Fri, Oct 22, 2021 at 01:17:26PM -0500, michael.christie@oracle.com wrote:
> On 10/22/21 11:12 AM, michael.christie@oracle.com wrote:
> > On 10/22/21 5:47 AM, Michael S. Tsirkin wrote:
> >>> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> >>> index c998860d7bbc..e5c0669430e5 100644
> >>> --- a/include/uapi/linux/vhost.h
> >>> +++ b/include/uapi/linux/vhost.h
> >>> @@ -70,6 +70,17 @@
> >>>  #define VHOST_VRING_BIG_ENDIAN 1
> >>>  #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
> >>>  #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> >>> +/* By default, a device gets one vhost_worker created during VHOST_SET_OWNER
> >>> + * that its virtqueues share. This allows userspace to create a vhost_worker
> >>> + * and map a virtqueue to it or map a virtqueue to an existing worker.
> >>> + *
> >>> + * If pid > 0 and it matches an existing vhost_worker thread it will be bound
> >>> + * to the vq. If pid is VHOST_VRING_NEW_WORKER, then a new worker will be
> >>> + * created and bound to the vq.
> >>> + *
> >>> + * This must be called after VHOST_SET_OWNER and before the vq is active.
> >>> + */
> >>
> >> A couple of things here:
> >> it's probably a good idea not to make it match pid exactly,
> >> if for no other reason than I'm not sure we want to
> >> commit this being a pid. Let's just call it an id?
> > 
> > Ok.
> > 
> >> And maybe byteswap it or xor with some value
> >> just to make sure userspace does not begin abusing it anyway.
> >>
> >> Also, interaction with pid namespace is unclear to me.
> >> Can you document what happens here?
> > 
> > This current patchset only allows the vhost_dev owner to
> > create/bind workers for devices it owns, so namespace don't come
> 
> I made a mistake here. The patches do restrict VHOST_SET_VRING_WORKER
> to the same owner like I wrote. However, it looks like we could have 2
> threads with the same mm pointer so vhost_dev_check_owner returns true,
> but they could be in different namespaces.
> 
> Even though we are not going to pass the pid_t between user/kernel
> space, should I add a pid namespace check when I repost the patches?

Um it's part of the ioctl. How you are not going to pass it around?

So if we do worry about this, I would just make it a 64 bit integer,
rename it "id" and increment each time a thread is created.
 
> 
> > into play. If a thread from another namespace tried to create/bind
> > a worker we would hit the owner checks in vhost_dev_ioctl which is
> > done before vhost_vring_ioctl normally (for vdpa we hit the use_worker
> > check and fail there).
> > 
> > However, with the kernel worker API changes the worker threads will
> > now be in the vhost dev owner's namespace and not the kthreadd/default
> > one, so in the future we are covered if we want to do something more
> > advanced. For example, I've seen people working on an API to export the
> > worker pids:
> > 
> > https://lore.kernel.org/netdev/20210507154332.hiblsd6ot5wzwkdj@steredhat/T/
> > 
> > and in the future for interfaces that export that info we could restrict
> > access to root or users from the same namespace or I guess add interfaces
> > to allow different namespaces to see the workers and share them.
> > 
> > 
> >> No need to fix funky things like moving the fd between
> >> pid namespaces while also creating/destroying workers, but let's
> >> document it's not supported.
> > 
> > Ok. I'll add a comment.
> >

Mike Christie Oct. 25, 2021, 4:04 p.m. UTC | #5

On 10/23/21 3:11 PM, Michael S. Tsirkin wrote:
> On Fri, Oct 22, 2021 at 01:17:26PM -0500, michael.christie@oracle.com wrote:
>> On 10/22/21 11:12 AM, michael.christie@oracle.com wrote:
>>> On 10/22/21 5:47 AM, Michael S. Tsirkin wrote:
>>>>> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
>>>>> index c998860d7bbc..e5c0669430e5 100644
>>>>> --- a/include/uapi/linux/vhost.h
>>>>> +++ b/include/uapi/linux/vhost.h
>>>>> @@ -70,6 +70,17 @@
>>>>>  #define VHOST_VRING_BIG_ENDIAN 1
>>>>>  #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
>>>>>  #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
>>>>> +/* By default, a device gets one vhost_worker created during VHOST_SET_OWNER
>>>>> + * that its virtqueues share. This allows userspace to create a vhost_worker
>>>>> + * and map a virtqueue to it or map a virtqueue to an existing worker.
>>>>> + *
>>>>> + * If pid > 0 and it matches an existing vhost_worker thread it will be bound
>>>>> + * to the vq. If pid is VHOST_VRING_NEW_WORKER, then a new worker will be
>>>>> + * created and bound to the vq.
>>>>> + *
>>>>> + * This must be called after VHOST_SET_OWNER and before the vq is active.
>>>>> + */
>>>>
>>>> A couple of things here:
>>>> it's probably a good idea not to make it match pid exactly,
>>>> if for no other reason than I'm not sure we want to
>>>> commit this being a pid. Let's just call it an id?
>>>
>>> Ok.
>>>
>>>> And maybe byteswap it or xor with some value
>>>> just to make sure userspace does not begin abusing it anyway.
>>>>
>>>> Also, interaction with pid namespace is unclear to me.
>>>> Can you document what happens here?
>>>
>>> This current patchset only allows the vhost_dev owner to
>>> create/bind workers for devices it owns, so namespace don't come
>>
>> I made a mistake here. The patches do restrict VHOST_SET_VRING_WORKER
>> to the same owner like I wrote. However, it looks like we could have 2
>> threads with the same mm pointer so vhost_dev_check_owner returns true,
>> but they could be in different namespaces.
>>
>> Even though we are not going to pass the pid_t between user/kernel
>> space, should I add a pid namespace check when I repost the patches?
> 
> Um it's part of the ioctl. How you are not going to pass it around?

The not passing a pid around was referring to your comment about
obfuscating the pid. I might have misunderstood you and thought you
wanted to do something more like you suggested below where to userspace
it's just some int as far as userspace knows.


> 
> So if we do worry about this, I would just make it a 64 bit integer,
> rename it "id" and increment each time a thread is created.
>  
Yeah, this works for me. I just used a ida to allocate the id. We can 
then use it's lookup functions too.

Michael S. Tsirkin Oct. 25, 2021, 5:14 p.m. UTC | #6

On Mon, Oct 25, 2021 at 11:04:42AM -0500, michael.christie@oracle.com wrote:
> On 10/23/21 3:11 PM, Michael S. Tsirkin wrote:
> > On Fri, Oct 22, 2021 at 01:17:26PM -0500, michael.christie@oracle.com wrote:
> >> On 10/22/21 11:12 AM, michael.christie@oracle.com wrote:
> >>> On 10/22/21 5:47 AM, Michael S. Tsirkin wrote:
> >>>>> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> >>>>> index c998860d7bbc..e5c0669430e5 100644
> >>>>> --- a/include/uapi/linux/vhost.h
> >>>>> +++ b/include/uapi/linux/vhost.h
> >>>>> @@ -70,6 +70,17 @@
> >>>>>  #define VHOST_VRING_BIG_ENDIAN 1
> >>>>>  #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
> >>>>>  #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> >>>>> +/* By default, a device gets one vhost_worker created during VHOST_SET_OWNER
> >>>>> + * that its virtqueues share. This allows userspace to create a vhost_worker
> >>>>> + * and map a virtqueue to it or map a virtqueue to an existing worker.
> >>>>> + *
> >>>>> + * If pid > 0 and it matches an existing vhost_worker thread it will be bound
> >>>>> + * to the vq. If pid is VHOST_VRING_NEW_WORKER, then a new worker will be
> >>>>> + * created and bound to the vq.
> >>>>> + *
> >>>>> + * This must be called after VHOST_SET_OWNER and before the vq is active.
> >>>>> + */
> >>>>
> >>>> A couple of things here:
> >>>> it's probably a good idea not to make it match pid exactly,
> >>>> if for no other reason than I'm not sure we want to
> >>>> commit this being a pid. Let's just call it an id?
> >>>
> >>> Ok.
> >>>
> >>>> And maybe byteswap it or xor with some value
> >>>> just to make sure userspace does not begin abusing it anyway.
> >>>>
> >>>> Also, interaction with pid namespace is unclear to me.
> >>>> Can you document what happens here?
> >>>
> >>> This current patchset only allows the vhost_dev owner to
> >>> create/bind workers for devices it owns, so namespace don't come
> >>
> >> I made a mistake here. The patches do restrict VHOST_SET_VRING_WORKER
> >> to the same owner like I wrote. However, it looks like we could have 2
> >> threads with the same mm pointer so vhost_dev_check_owner returns true,
> >> but they could be in different namespaces.
> >>
> >> Even though we are not going to pass the pid_t between user/kernel
> >> space, should I add a pid namespace check when I repost the patches?
> > 
> > Um it's part of the ioctl. How you are not going to pass it around?
> 
> The not passing a pid around was referring to your comment about
> obfuscating the pid. I might have misunderstood you and thought you
> wanted to do something more like you suggested below where to userspace
> it's just some int as far as userspace knows.
> 
> 
> > 
> > So if we do worry about this, I would just make it a 64 bit integer,
> > rename it "id" and increment each time a thread is created.
> >  
> Yeah, this works for me. I just used a ida to allocate the id. We can 
> then use it's lookup functions too.

Probably for the best, linear lookups will make destroying lots of
threads and O(N^2) operation.

Jason Wang Oct. 26, 2021, 5:37 a.m. UTC | #7

在 2021/10/22 下午1:19, Mike Christie 写道:
> This patch allows userspace to create workers and bind them to vqs. You
> can have N workers per dev and also share N workers with M vqs.
>
> Signed-off-by: Mike Christie <michael.christie@oracle.com>


A question, who is the best one to determine the binding? Is it the VMM 
(Qemu etc) or the management stack? If the latter, it looks to me it's 
better to expose this via sysfs?


> ---
>   drivers/vhost/vhost.c            | 99 ++++++++++++++++++++++++++++----
>   drivers/vhost/vhost.h            |  2 +-
>   include/uapi/linux/vhost.h       | 11 ++++
>   include/uapi/linux/vhost_types.h | 12 ++++
>   4 files changed, 112 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 04f43a6445e1..c86e88d7f35c 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -493,7 +493,6 @@ void vhost_dev_init(struct vhost_dev *dev,
>   	dev->umem = NULL;
>   	dev->iotlb = NULL;
>   	dev->mm = NULL;
> -	dev->worker = NULL;
>   	dev->iov_limit = iov_limit;
>   	dev->weight = weight;
>   	dev->byte_weight = byte_weight;
> @@ -576,20 +575,40 @@ static void vhost_worker_stop(struct vhost_worker *worker)
>   	wait_for_completion(worker->exit_done);
>   }
>   
> -static void vhost_worker_free(struct vhost_dev *dev)
> -{
> -	struct vhost_worker *worker = dev->worker;
>   
> +static void vhost_worker_put(struct vhost_worker *worker)
> +{
>   	if (!worker)
>   		return;
>   
> -	dev->worker = NULL;
> +	if (!refcount_dec_and_test(&worker->refcount))
> +		return;
> +
>   	WARN_ON(!llist_empty(&worker->work_list));
>   	vhost_worker_stop(worker);
>   	kfree(worker);
>   }
>   
> -static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
> +static void vhost_vq_clear_worker(struct vhost_virtqueue *vq)
> +{
> +	if (vq->worker)
> +		vhost_worker_put(vq->worker);
> +	vq->worker = NULL;
> +}
> +
> +static void vhost_workers_free(struct vhost_dev *dev)
> +{
> +	int i;
> +
> +	if (!dev->use_worker)
> +		return;
> +
> +	for (i = 0; i < dev->nvqs; i++)
> +		vhost_vq_clear_worker(dev->vqs[i]);
> +}
> +
> +static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev,
> +						int init_vq_map_count)
>   {
>   	struct vhost_worker *worker;
>   	struct task_struct *task;
> @@ -598,9 +617,9 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
>   	if (!worker)
>   		return NULL;
>   
> -	dev->worker = worker;
>   	worker->kcov_handle = kcov_common_handle();
>   	init_llist_head(&worker->work_list);
> +	refcount_set(&worker->refcount, init_vq_map_count);
>   
>   	/*
>   	 * vhost used to use the kthread API which ignores all signals by
> @@ -617,10 +636,58 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
>   
>   free_worker:
>   	kfree(worker);
> -	dev->worker = NULL;
>   	return NULL;
>   }
>   
> +static struct vhost_worker *vhost_worker_find(struct vhost_dev *dev, pid_t pid)
> +{
> +	struct vhost_worker *worker = NULL;
> +	int i;
> +
> +	for (i = 0; i < dev->nvqs; i++) {
> +		if (dev->vqs[i]->worker->task->pid != pid)
> +			continue;
> +
> +		worker = dev->vqs[i]->worker;
> +		break;
> +	}
> +
> +	return worker;
> +}
> +
> +/* Caller must have device mutex */
> +static int vhost_vq_setup_worker(struct vhost_virtqueue *vq,
> +				 struct vhost_vring_worker *info)
> +{
> +	struct vhost_dev *dev = vq->dev;
> +	struct vhost_worker *worker;
> +
> +	if (!dev->use_worker)
> +		return -EINVAL;
> +
> +	/* We don't support setting a worker on an active vq */
> +	if (vq->private_data)
> +		return -EBUSY;


Is it valuable to allow the worker switching on active vq?


> +
> +	if (info->pid == VHOST_VRING_NEW_WORKER) {
> +		worker = vhost_worker_create(dev, 1);
> +		if (!worker)
> +			return -ENOMEM;
> +
> +		info->pid = worker->task->pid;
> +	} else {
> +		worker = vhost_worker_find(dev, info->pid);
> +		if (!worker)
> +			return -ENODEV;
> +
> +		refcount_inc(&worker->refcount);
> +	}
> +
> +	vhost_vq_clear_worker(vq);
> +	vq->worker = worker;
> +	return 0;
> +}
> +
>   /* Caller should have device mutex */
>   long vhost_dev_set_owner(struct vhost_dev *dev)
>   {
> @@ -636,7 +703,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>   	vhost_attach_mm(dev);
>   
>   	if (dev->use_worker) {
> -		worker = vhost_worker_create(dev);
> +		worker = vhost_worker_create(dev, dev->nvqs);
>   		if (!worker)
>   			goto err_worker;
>   
> @@ -650,7 +717,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>   
>   	return 0;
>   err_iovecs:
> -	vhost_worker_free(dev);
> +	vhost_workers_free(dev);
>   err_worker:
>   	vhost_detach_mm(dev);
>   err_mm:
> @@ -742,7 +809,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>   	dev->iotlb = NULL;
>   	vhost_clear_msg(dev);
>   	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
> -	vhost_worker_free(dev);
> +	vhost_workers_free(dev);
>   	vhost_detach_mm(dev);
>   }
>   EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
> @@ -1612,6 +1679,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>   	struct eventfd_ctx *ctx = NULL;
>   	u32 __user *idxp = argp;
>   	struct vhost_virtqueue *vq;
> +	struct vhost_vring_worker w;
>   	struct vhost_vring_state s;
>   	struct vhost_vring_file f;
>   	u32 idx;
> @@ -1719,6 +1787,15 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>   		if (copy_to_user(argp, &s, sizeof(s)))
>   			r = -EFAULT;
>   		break;
> +	case VHOST_SET_VRING_WORKER:
> +		if (copy_from_user(&w, argp, sizeof(w))) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		r = vhost_vq_setup_worker(vq, &w);
> +		if (!r && copy_to_user(argp, &w, sizeof(w)))
> +			r = -EFAULT;
> +		break;
>   	default:
>   		r = -ENOIOCTLCMD;
>   	}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 33c63b24187a..0911d1a9bd3b 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -35,6 +35,7 @@ struct vhost_worker {
>   	struct llist_head	work_list;
>   	u64			kcov_handle;
>   	unsigned long		flags;
> +	refcount_t		refcount;
>   };
>   
>   /* Poll a file (eventfd or socket) */
> @@ -160,7 +161,6 @@ struct vhost_dev {
>   	struct vhost_virtqueue **vqs;
>   	int nvqs;
>   	struct eventfd_ctx *log_ctx;
> -	struct vhost_worker *worker;
>   	struct vhost_iotlb *umem;
>   	struct vhost_iotlb *iotlb;
>   	spinlock_t iotlb_lock;
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index c998860d7bbc..e5c0669430e5 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -70,6 +70,17 @@
>   #define VHOST_VRING_BIG_ENDIAN 1
>   #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
>   #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> +/* By default, a device gets one vhost_worker created during VHOST_SET_OWNER
> + * that its virtqueues share. This allows userspace to create a vhost_worker
> + * and map a virtqueue to it or map a virtqueue to an existing worker.
> + *
> + * If pid > 0 and it matches an existing vhost_worker thread it will be bound
> + * to the vq. If pid is VHOST_VRING_NEW_WORKER, then a new worker will be
> + * created and bound to the vq.
> + *
> + * This must be called after VHOST_SET_OWNER and before the vq is active.
> + */
> +#define VHOST_SET_VRING_WORKER _IOWR(VHOST_VIRTIO, 0x15, struct vhost_vring_worker)
>   
>   /* The following ioctls use eventfd file descriptors to signal and poll
>    * for events. */
> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> index f7f6a3a28977..af654e3cef0e 100644
> --- a/include/uapi/linux/vhost_types.h
> +++ b/include/uapi/linux/vhost_types.h
> @@ -47,6 +47,18 @@ struct vhost_vring_addr {
>   	__u64 log_guest_addr;
>   };
>   
> +#define VHOST_VRING_NEW_WORKER -1


Do we need VHOST_VRING_FREE_WORKER? And I wonder if using dedicated 
ioctls are better:

VHOST_VRING_NEW/FREE_WORKER
VHOST_VRING_ATTACH_WORKER

etc.

Thanks


> +
> +struct vhost_vring_worker {
> +	unsigned int index;
> +	/*
> +	 * The pid of the vhost worker that the vq will be bound to. If
> +	 * pid is VHOST_VRING_NEW_WORKER a new worker will be created and its
> +	 * pid will be returned in pid.
> +	 */
> +	__kernel_pid_t pid;
> +};
> +
>   /* no alignment requirement */
>   struct vhost_iotlb_msg {
>   	__u64 iova;

Michael S. Tsirkin Oct. 26, 2021, 1:09 p.m. UTC | #8

On Tue, Oct 26, 2021 at 01:37:14PM +0800, Jason Wang wrote:
> 
> 在 2021/10/22 下午1:19, Mike Christie 写道:
> > This patch allows userspace to create workers and bind them to vqs. You
> > can have N workers per dev and also share N workers with M vqs.
> > 
> > Signed-off-by: Mike Christie <michael.christie@oracle.com>
> 
> 
> A question, who is the best one to determine the binding? Is it the VMM
> (Qemu etc) or the management stack? If the latter, it looks to me it's
> better to expose this via sysfs?

I think it's a bit much to expect this from management.

> 
> > ---
> >   drivers/vhost/vhost.c            | 99 ++++++++++++++++++++++++++++----
> >   drivers/vhost/vhost.h            |  2 +-
> >   include/uapi/linux/vhost.h       | 11 ++++
> >   include/uapi/linux/vhost_types.h | 12 ++++
> >   4 files changed, 112 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 04f43a6445e1..c86e88d7f35c 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -493,7 +493,6 @@ void vhost_dev_init(struct vhost_dev *dev,
> >   	dev->umem = NULL;
> >   	dev->iotlb = NULL;
> >   	dev->mm = NULL;
> > -	dev->worker = NULL;
> >   	dev->iov_limit = iov_limit;
> >   	dev->weight = weight;
> >   	dev->byte_weight = byte_weight;
> > @@ -576,20 +575,40 @@ static void vhost_worker_stop(struct vhost_worker *worker)
> >   	wait_for_completion(worker->exit_done);
> >   }
> > -static void vhost_worker_free(struct vhost_dev *dev)
> > -{
> > -	struct vhost_worker *worker = dev->worker;
> > +static void vhost_worker_put(struct vhost_worker *worker)
> > +{
> >   	if (!worker)
> >   		return;
> > -	dev->worker = NULL;
> > +	if (!refcount_dec_and_test(&worker->refcount))
> > +		return;
> > +
> >   	WARN_ON(!llist_empty(&worker->work_list));
> >   	vhost_worker_stop(worker);
> >   	kfree(worker);
> >   }
> > -static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
> > +static void vhost_vq_clear_worker(struct vhost_virtqueue *vq)
> > +{
> > +	if (vq->worker)
> > +		vhost_worker_put(vq->worker);
> > +	vq->worker = NULL;
> > +}
> > +
> > +static void vhost_workers_free(struct vhost_dev *dev)
> > +{
> > +	int i;
> > +
> > +	if (!dev->use_worker)
> > +		return;
> > +
> > +	for (i = 0; i < dev->nvqs; i++)
> > +		vhost_vq_clear_worker(dev->vqs[i]);
> > +}
> > +
> > +static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev,
> > +						int init_vq_map_count)
> >   {
> >   	struct vhost_worker *worker;
> >   	struct task_struct *task;
> > @@ -598,9 +617,9 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
> >   	if (!worker)
> >   		return NULL;
> > -	dev->worker = worker;
> >   	worker->kcov_handle = kcov_common_handle();
> >   	init_llist_head(&worker->work_list);
> > +	refcount_set(&worker->refcount, init_vq_map_count);
> >   	/*
> >   	 * vhost used to use the kthread API which ignores all signals by
> > @@ -617,10 +636,58 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
> >   free_worker:
> >   	kfree(worker);
> > -	dev->worker = NULL;
> >   	return NULL;
> >   }
> > +static struct vhost_worker *vhost_worker_find(struct vhost_dev *dev, pid_t pid)
> > +{
> > +	struct vhost_worker *worker = NULL;
> > +	int i;
> > +
> > +	for (i = 0; i < dev->nvqs; i++) {
> > +		if (dev->vqs[i]->worker->task->pid != pid)
> > +			continue;
> > +
> > +		worker = dev->vqs[i]->worker;
> > +		break;
> > +	}
> > +
> > +	return worker;
> > +}
> > +
> > +/* Caller must have device mutex */
> > +static int vhost_vq_setup_worker(struct vhost_virtqueue *vq,
> > +				 struct vhost_vring_worker *info)
> > +{
> > +	struct vhost_dev *dev = vq->dev;
> > +	struct vhost_worker *worker;
> > +
> > +	if (!dev->use_worker)
> > +		return -EINVAL;
> > +
> > +	/* We don't support setting a worker on an active vq */
> > +	if (vq->private_data)
> > +		return -EBUSY;
> 
> 
> Is it valuable to allow the worker switching on active vq?
> 
> 
> > +
> > +	if (info->pid == VHOST_VRING_NEW_WORKER) {
> > +		worker = vhost_worker_create(dev, 1);
> > +		if (!worker)
> > +			return -ENOMEM;
> > +
> > +		info->pid = worker->task->pid;
> > +	} else {
> > +		worker = vhost_worker_find(dev, info->pid);
> > +		if (!worker)
> > +			return -ENODEV;
> > +
> > +		refcount_inc(&worker->refcount);
> > +	}
> > +
> > +	vhost_vq_clear_worker(vq);
> > +	vq->worker = worker;
> > +	return 0;
> > +}
> > +
> >   /* Caller should have device mutex */
> >   long vhost_dev_set_owner(struct vhost_dev *dev)
> >   {
> > @@ -636,7 +703,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
> >   	vhost_attach_mm(dev);
> >   	if (dev->use_worker) {
> > -		worker = vhost_worker_create(dev);
> > +		worker = vhost_worker_create(dev, dev->nvqs);
> >   		if (!worker)
> >   			goto err_worker;
> > @@ -650,7 +717,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
> >   	return 0;
> >   err_iovecs:
> > -	vhost_worker_free(dev);
> > +	vhost_workers_free(dev);
> >   err_worker:
> >   	vhost_detach_mm(dev);
> >   err_mm:
> > @@ -742,7 +809,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> >   	dev->iotlb = NULL;
> >   	vhost_clear_msg(dev);
> >   	wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
> > -	vhost_worker_free(dev);
> > +	vhost_workers_free(dev);
> >   	vhost_detach_mm(dev);
> >   }
> >   EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
> > @@ -1612,6 +1679,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> >   	struct eventfd_ctx *ctx = NULL;
> >   	u32 __user *idxp = argp;
> >   	struct vhost_virtqueue *vq;
> > +	struct vhost_vring_worker w;
> >   	struct vhost_vring_state s;
> >   	struct vhost_vring_file f;
> >   	u32 idx;
> > @@ -1719,6 +1787,15 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> >   		if (copy_to_user(argp, &s, sizeof(s)))
> >   			r = -EFAULT;
> >   		break;
> > +	case VHOST_SET_VRING_WORKER:
> > +		if (copy_from_user(&w, argp, sizeof(w))) {
> > +			r = -EFAULT;
> > +			break;
> > +		}
> > +		r = vhost_vq_setup_worker(vq, &w);
> > +		if (!r && copy_to_user(argp, &w, sizeof(w)))
> > +			r = -EFAULT;
> > +		break;
> >   	default:
> >   		r = -ENOIOCTLCMD;
> >   	}
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index 33c63b24187a..0911d1a9bd3b 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -35,6 +35,7 @@ struct vhost_worker {
> >   	struct llist_head	work_list;
> >   	u64			kcov_handle;
> >   	unsigned long		flags;
> > +	refcount_t		refcount;
> >   };
> >   /* Poll a file (eventfd or socket) */
> > @@ -160,7 +161,6 @@ struct vhost_dev {
> >   	struct vhost_virtqueue **vqs;
> >   	int nvqs;
> >   	struct eventfd_ctx *log_ctx;
> > -	struct vhost_worker *worker;
> >   	struct vhost_iotlb *umem;
> >   	struct vhost_iotlb *iotlb;
> >   	spinlock_t iotlb_lock;
> > diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> > index c998860d7bbc..e5c0669430e5 100644
> > --- a/include/uapi/linux/vhost.h
> > +++ b/include/uapi/linux/vhost.h
> > @@ -70,6 +70,17 @@
> >   #define VHOST_VRING_BIG_ENDIAN 1
> >   #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
> >   #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> > +/* By default, a device gets one vhost_worker created during VHOST_SET_OWNER
> > + * that its virtqueues share. This allows userspace to create a vhost_worker
> > + * and map a virtqueue to it or map a virtqueue to an existing worker.
> > + *
> > + * If pid > 0 and it matches an existing vhost_worker thread it will be bound
> > + * to the vq. If pid is VHOST_VRING_NEW_WORKER, then a new worker will be
> > + * created and bound to the vq.
> > + *
> > + * This must be called after VHOST_SET_OWNER and before the vq is active.
> > + */
> > +#define VHOST_SET_VRING_WORKER _IOWR(VHOST_VIRTIO, 0x15, struct vhost_vring_worker)
> >   /* The following ioctls use eventfd file descriptors to signal and poll
> >    * for events. */
> > diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> > index f7f6a3a28977..af654e3cef0e 100644
> > --- a/include/uapi/linux/vhost_types.h
> > +++ b/include/uapi/linux/vhost_types.h
> > @@ -47,6 +47,18 @@ struct vhost_vring_addr {
> >   	__u64 log_guest_addr;
> >   };
> > +#define VHOST_VRING_NEW_WORKER -1
> 
> 
> Do we need VHOST_VRING_FREE_WORKER? And I wonder if using dedicated ioctls
> are better:
> 
> VHOST_VRING_NEW/FREE_WORKER
> VHOST_VRING_ATTACH_WORKER
> 
> etc.
> 
> Thanks
> 
> 
> > +
> > +struct vhost_vring_worker {
> > +	unsigned int index;
> > +	/*
> > +	 * The pid of the vhost worker that the vq will be bound to. If
> > +	 * pid is VHOST_VRING_NEW_WORKER a new worker will be created and its
> > +	 * pid will be returned in pid.
> > +	 */
> > +	__kernel_pid_t pid;
> > +};
> > +
> >   /* no alignment requirement */
> >   struct vhost_iotlb_msg {
> >   	__u64 iova;

Stefan Hajnoczi Oct. 26, 2021, 3:22 p.m. UTC | #9

On Fri, Oct 22, 2021 at 12:19:11AM -0500, Mike Christie wrote:
> +/* Caller must have device mutex */
> +static int vhost_vq_setup_worker(struct vhost_virtqueue *vq,
> +				 struct vhost_vring_worker *info)

It's clearer if the function name matches the ioctl name
(VHOST_SET_VRING_WORKER).

Stefan Hajnoczi Oct. 26, 2021, 3:24 p.m. UTC | #10

On Fri, Oct 22, 2021 at 12:19:11AM -0500, Mike Christie wrote:
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index c998860d7bbc..e5c0669430e5 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -70,6 +70,17 @@
>  #define VHOST_VRING_BIG_ENDIAN 1
>  #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_state)
>  #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> +/* By default, a device gets one vhost_worker created during VHOST_SET_OWNER
> + * that its virtqueues share. This allows userspace to create a vhost_worker
> + * and map a virtqueue to it or map a virtqueue to an existing worker.
> + *
> + * If pid > 0 and it matches an existing vhost_worker thread it will be bound
> + * to the vq. If pid is VHOST_VRING_NEW_WORKER, then a new worker will be
> + * created and bound to the vq.
> + *
> + * This must be called after VHOST_SET_OWNER and before the vq is active.
> + */
> +#define VHOST_SET_VRING_WORKER _IOWR(VHOST_VIRTIO, 0x15, struct vhost_vring_worker)

Please clarify whether or not multiple devices can attach vqs to the same worker.

Stefan

Stefan Hajnoczi Oct. 26, 2021, 3:44 p.m. UTC | #11

On Tue, Oct 26, 2021 at 01:37:14PM +0800, Jason Wang wrote:
> 
> 在 2021/10/22 下午1:19, Mike Christie 写道:
> > This patch allows userspace to create workers and bind them to vqs. You
> > can have N workers per dev and also share N workers with M vqs.
> > 
> > Signed-off-by: Mike Christie <michael.christie@oracle.com>
> 
> 
> A question, who is the best one to determine the binding? Is it the VMM
> (Qemu etc) or the management stack? If the latter, it looks to me it's
> better to expose this via sysfs?

A few options that let the management stack control vhost worker CPU
affinity:

1. The management tool opens the vhost device node, calls
   ioctl(VHOST_SET_VRING_WORKER), sets up CPU affinity, and then passes
   the fd to the VMM. In this case the VMM is still able to call the
   ioctl, which may be undesirable from an attack surface perspective.

2. The VMM calls ioctl(VHOST_SET_VRING_WORKER) itself and the management
   tool queries the vq:worker details from the VMM (e.g. a new QEMU QMP
   query-vhost-workers command similar to query-iothreads). The
   management tool can then control CPU affinity on the vhost worker
   threads.

   (This is how CPU affinity works in QEMU and libvirt today.)

3. The sysfs approach you suggested. Does sysfs export vq-0/, vq-1/, etc
   directories with a "worker" attribute? Do we need to define a point
   when the VMM has set up vqs and the management stack is able to query
   them? Vhost devices currently pre-allocate the maximum number of vqs
   and I'm not sure how to determine the number of vqs that will
   actually be used?

   One advantage of this is that access to the vq:worker mapping can be
   limited to the management stack and the VMM cannot access it. But it
   seems a little tricky because the vhost model today doesn't use sysfs
   or define a lifecycle where the management stack can configure
   devices.

Stefan

Stefan Hajnoczi Oct. 26, 2021, 4:36 p.m. UTC | #12

On Tue, Oct 26, 2021 at 09:09:52AM -0400, Michael S. Tsirkin wrote:
> On Tue, Oct 26, 2021 at 01:37:14PM +0800, Jason Wang wrote:
> > 
> > 在 2021/10/22 下午1:19, Mike Christie 写道:
> > > This patch allows userspace to create workers and bind them to vqs. You
> > > can have N workers per dev and also share N workers with M vqs.
> > > 
> > > Signed-off-by: Mike Christie <michael.christie@oracle.com>
> > 
> > 
> > A question, who is the best one to determine the binding? Is it the VMM
> > (Qemu etc) or the management stack? If the latter, it looks to me it's
> > better to expose this via sysfs?
> 
> I think it's a bit much to expect this from management.

The management stack controls the number of vqs used as well as the vCPU
and IOThread CPU affinity. It seems natural for it to also control the
vhost worker CPU affinity. Where else should that be controlled?

Stefan

Mike Christie Oct. 26, 2021, 4:49 p.m. UTC | #13

On 10/26/21 12:37 AM, Jason Wang wrote:
> 
> 在 2021/10/22 下午1:19, Mike Christie 写道:
>> This patch allows userspace to create workers and bind them to vqs. You
>> can have N workers per dev and also share N workers with M vqs.
>>
>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> 
> 
> A question, who is the best one to determine the binding? Is it the VMM (Qemu etc) or the management stack? If the latter, it looks to me it's better to expose this via sysfs?

I thought it would be where you have management app settings, then the
management app talks to the qemu control interface like it does when it
adds new devices on the fly.

A problem with the management app doing it is to handle the RLIMIT_NPROC
review comment, this patchset:

https://lore.kernel.org/all/20211007214448.6282-1-michael.christie@oracle.com/

basically has the kernel do a clone() from the caller's context. So adding
a worker is like doing the VHOST_SET_OWNER ioctl where it still has to be done
from a process you can inherit values like the mm, cgroups, and now RLIMITs.

>> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
>> index f7f6a3a28977..af654e3cef0e 100644
>> --- a/include/uapi/linux/vhost_types.h
>> +++ b/include/uapi/linux/vhost_types.h
>> @@ -47,6 +47,18 @@ struct vhost_vring_addr {
>>       __u64 log_guest_addr;
>>   };
>>   +#define VHOST_VRING_NEW_WORKER -1
> 
> 
> Do we need VHOST_VRING_FREE_WORKER? And I wonder if using dedicated ioctls are better:
> 
> VHOST_VRING_NEW/FREE_WORKER
> VHOST_VRING_ATTACH_WORKER

We didn't need a free worker, because the kernel handles it for userspace. I
tried to make it easy for userspace because in some cases it may not be able
to do syscalls like close on the device. For example if qemu crashes or for
vhost-scsi we don't do an explicit close during VM shutdown.

So we start off with the default worker thread that's used by all vqs like we do
today. Userspace can then override it by creating a new worker. That also unbinds/
detaches the existing worker and does a put on the workers refcount. We also do a
put on the worker when we stop using it during device shutdown/closure/release.
When the worker's refcount goes to zero the kernel deletes it.

I think separating the calls could be helpful though.

Jason Wang Oct. 27, 2021, 2:55 a.m. UTC | #14

On Tue, Oct 26, 2021 at 11:45 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Oct 26, 2021 at 01:37:14PM +0800, Jason Wang wrote:
> >
> > 在 2021/10/22 下午1:19, Mike Christie 写道:
> > > This patch allows userspace to create workers and bind them to vqs. You
> > > can have N workers per dev and also share N workers with M vqs.
> > >
> > > Signed-off-by: Mike Christie <michael.christie@oracle.com>
> >
> >
> > A question, who is the best one to determine the binding? Is it the VMM
> > (Qemu etc) or the management stack? If the latter, it looks to me it's
> > better to expose this via sysfs?
>
> A few options that let the management stack control vhost worker CPU
> affinity:
>
> 1. The management tool opens the vhost device node, calls
>    ioctl(VHOST_SET_VRING_WORKER), sets up CPU affinity, and then passes
>    the fd to the VMM. In this case the VMM is still able to call the
>    ioctl, which may be undesirable from an attack surface perspective.

Yes, and we can't do post or dynamic configuration afterwards after
the VM is launched?

>
> 2. The VMM calls ioctl(VHOST_SET_VRING_WORKER) itself and the management
>    tool queries the vq:worker details from the VMM (e.g. a new QEMU QMP
>    query-vhost-workers command similar to query-iothreads). The
>    management tool can then control CPU affinity on the vhost worker
>    threads.
>
>    (This is how CPU affinity works in QEMU and libvirt today.)

Then we also need a "bind-vhost-workers" command.

>
> 3. The sysfs approach you suggested. Does sysfs export vq-0/, vq-1/, etc
>    directories with a "worker" attribute?

Something like this.

> Do we need to define a point
>    when the VMM has set up vqs and the management stack is able to query
>    them?

It could be the point that the vhost fd is opened.

>  Vhost devices currently pre-allocate the maximum number of vqs
>    and I'm not sure how to determine the number of vqs that will
>    actually be used?

It requires more information to be exposed. But before this, we should
allow the dynamic binding of between vq and worker.

>
>    One advantage of this is that access to the vq:worker mapping can be
>    limited to the management stack and the VMM cannot access it. But it
>    seems a little tricky because the vhost model today doesn't use sysfs
>    or define a lifecycle where the management stack can configure
>    devices.

Yes.

Thanks

>
> Stefan

Jason Wang Oct. 27, 2021, 6:02 a.m. UTC | #15

On Wed, Oct 27, 2021 at 12:49 AM <michael.christie@oracle.com> wrote:
>
> On 10/26/21 12:37 AM, Jason Wang wrote:
> >
> > 在 2021/10/22 下午1:19, Mike Christie 写道:
> >> This patch allows userspace to create workers and bind them to vqs. You
> >> can have N workers per dev and also share N workers with M vqs.
> >>
> >> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> >
> >
> > A question, who is the best one to determine the binding? Is it the VMM (Qemu etc) or the management stack? If the latter, it looks to me it's better to expose this via sysfs?
>
> I thought it would be where you have management app settings, then the
> management app talks to the qemu control interface like it does when it
> adds new devices on the fly.
>
> A problem with the management app doing it is to handle the RLIMIT_NPROC
> review comment, this patchset:
>
> https://lore.kernel.org/all/20211007214448.6282-1-michael.christie@oracle.com/
>
> basically has the kernel do a clone() from the caller's context. So adding
> a worker is like doing the VHOST_SET_OWNER ioctl where it still has to be done
> from a process you can inherit values like the mm, cgroups, and now RLIMITs.

Right, so as Stefan suggested, we probably need new QMP commands then
management can help there. Then it can satisfy the model you described
above.

>
>
> >> diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
> >> index f7f6a3a28977..af654e3cef0e 100644
> >> --- a/include/uapi/linux/vhost_types.h
> >> +++ b/include/uapi/linux/vhost_types.h
> >> @@ -47,6 +47,18 @@ struct vhost_vring_addr {
> >>       __u64 log_guest_addr;
> >>   };
> >>   +#define VHOST_VRING_NEW_WORKER -1
> >
> >
> > Do we need VHOST_VRING_FREE_WORKER? And I wonder if using dedicated ioctls are better:
> >
> > VHOST_VRING_NEW/FREE_WORKER
> > VHOST_VRING_ATTACH_WORKER
>
>
> We didn't need a free worker, because the kernel handles it for userspace. I
> tried to make it easy for userspace because in some cases it may not be able
> to do syscalls like close on the device. For example if qemu crashes or for
> vhost-scsi we don't do an explicit close during VM shutdown.
>

Ok, the motivation is that if in some cases (e.g the active number of
queues are changed), qemu can choose to free some resources.

> So we start off with the default worker thread that's used by all vqs like we do
> today. Userspace can then override it by creating a new worker. That also unbinds/
> detaches the existing worker and does a put on the workers refcount. We also do a
> put on the worker when we stop using it during device shutdown/closure/release.
> When the worker's refcount goes to zero the kernel deletes it.
>
> I think separating the calls could be helpful though.
>

Ok.

Thanks

Stefan Hajnoczi Oct. 27, 2021, 9:01 a.m. UTC | #16

On Wed, Oct 27, 2021 at 10:55:04AM +0800, Jason Wang wrote:
> On Tue, Oct 26, 2021 at 11:45 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Oct 26, 2021 at 01:37:14PM +0800, Jason Wang wrote:
> > >
> > > 在 2021/10/22 下午1:19, Mike Christie 写道:
> > > > This patch allows userspace to create workers and bind them to vqs. You
> > > > can have N workers per dev and also share N workers with M vqs.
> > > >
> > > > Signed-off-by: Mike Christie <michael.christie@oracle.com>
> > >
> > >
> > > A question, who is the best one to determine the binding? Is it the VMM
> > > (Qemu etc) or the management stack? If the latter, it looks to me it's
> > > better to expose this via sysfs?
> >
> > A few options that let the management stack control vhost worker CPU
> > affinity:
> >
> > 1. The management tool opens the vhost device node, calls
> >    ioctl(VHOST_SET_VRING_WORKER), sets up CPU affinity, and then passes
> >    the fd to the VMM. In this case the VMM is still able to call the
> >    ioctl, which may be undesirable from an attack surface perspective.
> 
> Yes, and we can't do post or dynamic configuration afterwards after
> the VM is launched?

Yes, at least it's a little risky for the management stack to keep the
vhost fd open and make ioctl calls while the VMM is using it.

> >
> > 2. The VMM calls ioctl(VHOST_SET_VRING_WORKER) itself and the management
> >    tool queries the vq:worker details from the VMM (e.g. a new QEMU QMP
> >    query-vhost-workers command similar to query-iothreads). The
> >    management tool can then control CPU affinity on the vhost worker
> >    threads.
> >
> >    (This is how CPU affinity works in QEMU and libvirt today.)
> 
> Then we also need a "bind-vhost-workers" command.

The VMM doesn't but the management tool does.

Stefan

Stefan Hajnoczi Oct. 27, 2021, 9:03 a.m. UTC | #17

On Tue, Oct 26, 2021 at 11:49:37AM -0500, michael.christie@oracle.com wrote:
> On 10/26/21 12:37 AM, Jason Wang wrote:
> > Do we need VHOST_VRING_FREE_WORKER? And I wonder if using dedicated ioctls are better:
> > 
> > VHOST_VRING_NEW/FREE_WORKER
> > VHOST_VRING_ATTACH_WORKER
> 
> 
> We didn't need a free worker, because the kernel handles it for userspace. I
> tried to make it easy for userspace because in some cases it may not be able
> to do syscalls like close on the device. For example if qemu crashes or for
> vhost-scsi we don't do an explicit close during VM shutdown.
> 
> So we start off with the default worker thread that's used by all vqs like we do
> today. Userspace can then override it by creating a new worker. That also unbinds/
> detaches the existing worker and does a put on the workers refcount. We also do a
> put on the worker when we stop using it during device shutdown/closure/release.
> When the worker's refcount goes to zero the kernel deletes it.

Please document the worker (p)id lifetime for the ioctl. Otherwise
userspace doesn't know whether a previously created worker is still
alive.

SSTefan

[V3,11/11] vhost: allow userspace to create workers

Commit Message

Comments

Patch