diff mbox

[PATCHv3,2/2] vhost_net: a kernel-level virtio server

Message ID 20090813182931.GC6585@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Michael S. Tsirkin Aug. 13, 2009, 6:29 p.m. UTC
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear.  I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a
tun-like device. In this version I only support raw socket as a backend,
which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
also configured by userspace, including vlan/mac etc.

Status:
This works for me, and I haven't see any crashes.
I have not run any benchmarks yet, compared to userspace, I expect to
see improved latency (as I save up to 4 system calls per packet) but not
bandwidth/CPU (as TSO and interrupt mitigation are not supported).

Features that I plan to look at in the future:
- TSO
- interrupt mitigation
- zero copy

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 MAINTAINERS                |   10 +
 arch/x86/kvm/Kconfig       |    1 +
 drivers/Makefile           |    1 +
 drivers/vhost/Kconfig      |   11 +
 drivers/vhost/Makefile     |    2 +
 drivers/vhost/net.c        |  429 ++++++++++++++++++++++++++++
 drivers/vhost/vhost.c      |  663 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h      |  108 +++++++
 include/linux/Kbuild       |    1 +
 include/linux/miscdevice.h |    1 +
 include/linux/vhost.h      |  100 +++++++
 11 files changed, 1327 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/Kconfig
 create mode 100644 drivers/vhost/Makefile
 create mode 100644 drivers/vhost/net.c
 create mode 100644 drivers/vhost/vhost.c
 create mode 100644 drivers/vhost/vhost.h
 create mode 100644 include/linux/vhost.h

Comments

Arnd Bergmann Aug. 14, 2009, 11:40 a.m. UTC | #1
On Thursday 13 August 2009, Michael S. Tsirkin wrote:
> What it is: vhost net is a character device that can be used to reduce
> the number of system calls involved in virtio networking.
> Existing virtio net code is used in the guest without modification.

AFAICT, you have addressed all my comments, mostly by convincing me
that you got it right anyway ;-).

I hope this gets into 2.6.32, good work!

> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Acked-by: Arnd Bergmann <arnd@arndb.de>

One idea though:

> +	/* Parameter checking */
> +	if (sock->sk->sk_type != SOCK_RAW) {
> +		r = -ESOCKTNOSUPPORT;
> +		goto done;
> +	}
> +
> +	r = sock->ops->getname(sock, (struct sockaddr *)&uaddr.sa,
> +			       &uaddr_len, 0);
> +	if (r)
> +		goto done;
> +
> +	if (uaddr.sa.sll_family != AF_PACKET) {
> +		r = -EPFNOSUPPORT;
> +		goto done;
> +	}

You currently limit the scope of the driver by only allowing raw packet
sockets to be passed into the network driver. In qemu, we currently support
some very similar transports:

* raw packet (not in a release yet)
* tcp connection
* UDP multicast
* tap character device
* VDE with Unix local sockets

My primary interest right now is the tap support, but I think it would
be interesting in general to allow different file descriptor types
in vhost_net_set_socket. AFAICT, there are two major differences
that we need to handle for this:

* most of the transports are sockets, tap uses a character device.
  This could be dealt with by having both a struct socket * in
  struct vhost_net *and* a struct file *, or by always keeping the
  struct file and calling vfs_readv/vfs_writev for the data transport
  in both cases.

* Each transport has a slightly different header, we have
  - raw ethernet frames (raw, udp multicast, tap)
  - 32-bit length + raw frames, possibly fragmented (tcp)
  - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr)
  To handle these three cases, we need either different ioctl numbers
  so that vhost_net can choose the right one, or a flags field in
  VHOST_NET_SET_SOCKET, like

  #define VHOST_NET_RAW		1
  #define VHOST_NET_LEN_HDR	2
  #define VHOST_NET_VNET_HDR	4

  struct vhost_net_socket {
	unsigned int flags;
	int fd;
  };
  #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct vhost_net_socket)

If both of those are addressed, we can treat vhost_net as a generic
way to do network handling in the kernel independent of the qemu
model (raw, tap, ...) for it. 

Your qemu patch would have to work differently, so instead of 

qemu -net nic,vhost=eth0

you would do the same as today with the raw packet socket extension

qemu -net nic -net raw,ifname=eth0 

Qemu could then automatically try to use vhost_net, if it's available
in the kernel, or just fall back on software vlan otherwise.
Does that make sense?

	Arnd <>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Aug. 16, 2009, 6:51 a.m. UTC | #2
On Fri, Aug 14, 2009 at 01:40:36PM +0200, Arnd Bergmann wrote:
> On Thursday 13 August 2009, Michael S. Tsirkin wrote:
> > What it is: vhost net is a character device that can be used to reduce
> > the number of system calls involved in virtio networking.
> > Existing virtio net code is used in the guest without modification.
> 
> AFAICT, you have addressed all my comments, mostly by convincing me
> that you got it right anyway ;-).
> 
> I hope this gets into 2.6.32, good work!
> 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> Acked-by: Arnd Bergmann <arnd@arndb.de>
> 
> One idea though:
> 
> > +	/* Parameter checking */
> > +	if (sock->sk->sk_type != SOCK_RAW) {
> > +		r = -ESOCKTNOSUPPORT;
> > +		goto done;
> > +	}
> > +
> > +	r = sock->ops->getname(sock, (struct sockaddr *)&uaddr.sa,
> > +			       &uaddr_len, 0);
> > +	if (r)
> > +		goto done;
> > +
> > +	if (uaddr.sa.sll_family != AF_PACKET) {
> > +		r = -EPFNOSUPPORT;
> > +		goto done;
> > +	}
> 
> You currently limit the scope of the driver by only allowing raw packet
> sockets to be passed into the network driver. In qemu, we currently support
> some very similar transports:
> 
> * raw packet (not in a release yet)
> * tcp connection
> * UDP multicast
> * tap character device
> * VDE with Unix local sockets
> 
> My primary interest right now is the tap support, but I think it would
> be interesting in general to allow different file descriptor types
> in vhost_net_set_socket. AFAICT, there are two major differences
> that we need to handle for this:
> 
> * most of the transports are sockets, tap uses a character device.
>   This could be dealt with by having both a struct socket * in
>   struct vhost_net *and* a struct file *, or by always keeping the
>   struct file and calling vfs_readv/vfs_writev for the data transport
>   in both cases.

I am concerned that character devices might have weird side effects with
read/write operations and that calling them from kernel thread the way I
do might have security implications. Can't point at anything specific
though at the moment.
I wonder - can we expose the underlying socket used by tap, or will that
create complex lifetime issues?

> * Each transport has a slightly different header, we have
>   - raw ethernet frames (raw, udp multicast, tap)
>   - 32-bit length + raw frames, possibly fragmented (tcp)
>   - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr)
>   To handle these three cases, we need either different ioctl numbers
>   so that vhost_net can choose the right one, or a flags field in
>   VHOST_NET_SET_SOCKET, like
> 
>   #define VHOST_NET_RAW		1
>   #define VHOST_NET_LEN_HDR	2
>   #define VHOST_NET_VNET_HDR	4
> 
>   struct vhost_net_socket {
> 	unsigned int flags;
> 	int fd;
>   };
>   #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct vhost_net_socket)

It seems we can query the socket to find out the type, or use the
features ioctl.

> If both of those are addressed, we can treat vhost_net as a generic
> way to do network handling in the kernel independent of the qemu
> model (raw, tap, ...) for it. 
> 
> Your qemu patch would have to work differently, so instead of 
> 
> qemu -net nic,vhost=eth0
> 
> you would do the same as today with the raw packet socket extension
> 
> qemu -net nic -net raw,ifname=eth0 
> 
> Qemu could then automatically try to use vhost_net, if it's available
> in the kernel, or just fall back on software vlan otherwise.
> Does that make sense?
> 
> 	Arnd <>

I agree, long term it should be enabled automatically when possible.
Arnd Bergmann Aug. 19, 2009, 9:04 a.m. UTC | #3
On Sunday 16 August 2009, Michael S. Tsirkin wrote:
> On Fri, Aug 14, 2009 at 01:40:36PM +0200, Arnd Bergmann wrote:
> > 
> > * most of the transports are sockets, tap uses a character device.
> >   This could be dealt with by having both a struct socket * in
> >   struct vhost_net *and* a struct file *, or by always keeping the
> >   struct file and calling vfs_readv/vfs_writev for the data transport
> >   in both cases.
> 
> I am concerned that character devices might have weird side effects with
> read/write operations and that calling them from kernel thread the way I
> do might have security implications. Can't point at anything specific
> though at the moment.

I understand your feelings about passing a chardev fd into your driver
and I agree that we need to be very careful if we want to allow it.

Maybe we could instead extend the 'splice' system call to work on a
vhost_net file descriptor. If we do that, we can put the access back
into a user thread (or two) that stays in splice indefinetely to
avoid some of the implications of kernel threads like the missing
ability to handle transfer errors in user space.

> I wonder - can we expose the underlying socket used by tap, or will that
> create complex lifetime issues?

I think this could get more messy in the long run than calling vfs_readv
on a random fd. It would mean deep internal knowledge of the tap driver
in vhost_net, which I really would prefer to avoid.

> > * Each transport has a slightly different header, we have
> >   - raw ethernet frames (raw, udp multicast, tap)
> >   - 32-bit length + raw frames, possibly fragmented (tcp)
> >   - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr)
> >   To handle these three cases, we need either different ioctl numbers
> >   so that vhost_net can choose the right one, or a flags field in
> >   VHOST_NET_SET_SOCKET, like
> > 
> >   #define VHOST_NET_RAW		1
> >   #define VHOST_NET_LEN_HDR	2
> >   #define VHOST_NET_VNET_HDR	4
> > 
> >   struct vhost_net_socket {
> > 	unsigned int flags;
> > 	int fd;
> >   };
> >   #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct vhost_net_socket)
> 
> It seems we can query the socket to find out the type, 

yes, I understand that you can do that, but I still think that decision
should be left to user space. Adding a length header for TCP streams but
not for UDP is something that we would normally want to do, but IMHO
vhost_net should not need to know about this.

> or use the features ioctl.

Right, I had forgotten about that one. It's probably equivalent
to the flags I suggested, except that one allows you to set features
after starting the communication, while the other one prevents
you from doing that.

> > Qemu could then automatically try to use vhost_net, if it's available
> > in the kernel, or just fall back on software vlan otherwise.
> > Does that make sense?
> 
> I agree, long term it should be enabled automatically when possible.

So how about making the qemu command line interface an extension to
what Or Gerlitz has done for the raw packet sockets?

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Aug. 19, 2009, 1:04 p.m. UTC | #4
On Wed, Aug 19, 2009 at 11:04:50AM +0200, Arnd Bergmann wrote:
> On Sunday 16 August 2009, Michael S. Tsirkin wrote:
> > On Fri, Aug 14, 2009 at 01:40:36PM +0200, Arnd Bergmann wrote:
> > > 
> > > * most of the transports are sockets, tap uses a character device.
> > >   This could be dealt with by having both a struct socket * in
> > >   struct vhost_net *and* a struct file *, or by always keeping the
> > >   struct file and calling vfs_readv/vfs_writev for the data transport
> > >   in both cases.
> > 
> > I am concerned that character devices might have weird side effects with
> > read/write operations and that calling them from kernel thread the way I
> > do might have security implications. Can't point at anything specific
> > though at the moment.
> 
> I understand your feelings about passing a chardev fd into your driver
> and I agree that we need to be very careful if we want to allow it.
> 
> Maybe we could instead extend the 'splice' system call to work on a
> vhost_net file descriptor.  If we do that, we can put the access back
> into a user thread (or two) that stays in splice indefinetely

An issue with exposing internal threading model to userspace
in this way is that we lose control of e.g. CPU locality -
and it is very hard for userspace to get it right.

> to
> avoid some of the implications of kernel threads like the missing
> ability to handle transfer errors in user space.

Are you talking about TCP here?
Transfer errors are typically asynchronous - possibly eventfd
as I expose for vhost net is sufficient there.

> > I wonder - can we expose the underlying socket used by tap, or will that
> > create complex lifetime issues?
> 
> I think this could get more messy in the long run than calling vfs_readv
> on a random fd. It would mean deep internal knowledge of the tap driver
> in vhost_net, which I really would prefer to avoid.

No, what I had in mind is adding a GET_SOCKET ioctl to tap.
vhost would then just use the socket.

> > > * Each transport has a slightly different header, we have
> > >   - raw ethernet frames (raw, udp multicast, tap)
> > >   - 32-bit length + raw frames, possibly fragmented (tcp)
> > >   - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr)
> > >   To handle these three cases, we need either different ioctl numbers
> > >   so that vhost_net can choose the right one, or a flags field in
> > >   VHOST_NET_SET_SOCKET, like
> > > 
> > >   #define VHOST_NET_RAW		1
> > >   #define VHOST_NET_LEN_HDR	2
> > >   #define VHOST_NET_VNET_HDR	4
> > > 
> > >   struct vhost_net_socket {
> > > 	unsigned int flags;
> > > 	int fd;
> > >   };
> > >   #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct vhost_net_socket)
> > 
> > It seems we can query the socket to find out the type, 
> 
> yes, I understand that you can do that, but I still think that decision
> should be left to user space. Adding a length header for TCP streams but
> not for UDP is something that we would normally want to do, but IMHO
> vhost_net should not need to know about this.
> 
> > or use the features ioctl.
> 
> Right, I had forgotten about that one. It's probably equivalent
> to the flags I suggested, except that one allows you to set features
> after starting the communication, while the other one prevents
> you from doing that.
> 
> > > Qemu could then automatically try to use vhost_net, if it's available
> > > in the kernel, or just fall back on software vlan otherwise.
> > > Does that make sense?
> > 
> > I agree, long term it should be enabled automatically when possible.
> 
> So how about making the qemu command line interface an extension to
> what Or Gerlitz has done for the raw packet sockets?
> 
> 	Arnd <><

Not sure I see the connection, but I have not thought about qemu
side of things too much yet - trying to get kernel bits in place
first so that there's a stable ABI to work with.
Arnd Bergmann Aug. 19, 2009, 1:46 p.m. UTC | #5
On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > Maybe we could instead extend the 'splice' system call to work on a
> > vhost_net file descriptor.  If we do that, we can put the access back
> > into a user thread (or two) that stays in splice indefinetely
> 
> An issue with exposing internal threading model to userspace
> in this way is that we lose control of e.g. CPU locality -
> and it is very hard for userspace to get it right.

Good point, I hadn't thought about that in this context.

For macvtap, my idea was to open the same tap char device multiple
times and use each fd exclusively on one *guest* CPU. I'm not sure
if virtio-net can already handle SMP guests efficiently. We might
actually need to extend it to have more pairs of virtqueues, one
for each guest CPU, which can then be bound to a host queue (or queue
pair) in the physical nic.

Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET,
VHOST_SET_OWNER, VHOST_RESET_OWNER and your kernel thread with a new
VHOST_NET_SPLICE blocking ioctl that does all the transfers in the
context of the calling thread.

This would improve the driver on various fronts:

- no need for playing tricks with use_mm/unuse_mm
- possibly fewer global TLB flushes from switch_mm, which
  may improve performance.
- ability to pass down error codes from socket or guest to
  user space by returning from ioctl
- based on that, the ability to use any kind of file
  descriptor that can do writev/readv or sendmsg/recvmsg
  without the nastiness you mentioned.

The disadvantage of course is that you need to add a user
thread for each guest device to make up for the workqueue
that you save.

> > to
> > avoid some of the implications of kernel threads like the missing
> > ability to handle transfer errors in user space.
> 
> Are you talking about TCP here?
> Transfer errors are typically asynchronous - possibly eventfd
> as I expose for vhost net is sufficient there.

I mean errors in general if we allow random file descriptors to be used.
E.g. tun_chr_aio_read could return EBADFD, EINVAL, EFAULT, ERESTARTSYS,
EIO, EAGAIN and possibly others. We can handle some in kernel, others
should never happen with vhost_net, but if something unexpected happens
it would be nice to just bail out to user space.

> > > I wonder - can we expose the underlying socket used by tap, or will that
> > > create complex lifetime issues?
> > 
> > I think this could get more messy in the long run than calling vfs_readv
> > on a random fd. It would mean deep internal knowledge of the tap driver
> > in vhost_net, which I really would prefer to avoid.
> 
> No, what I had in mind is adding a GET_SOCKET ioctl to tap.
> vhost would then just use the socket.

Right, that would work with tun/tap at least. It sounds a bit fishy
but I can't see a reason why it would be hard to do.
I'd have to think about how to get it working with macvtap, or if
there is much value left in macvtap after that anyway.

> > So how about making the qemu command line interface an extension to
> > what Or Gerlitz has done for the raw packet sockets?
>
> Not sure I see the connection, but I have not thought about qemu
> side of things too much yet - trying to get kernel bits in place
> first so that there's a stable ABI to work with.

Ok, fair enough. The kernel bits are obviously more time critical
right now, since they should get into 2.6.32.

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Aug. 19, 2009, 2:20 p.m. UTC | #6
On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote:
> On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > > Maybe we could instead extend the 'splice' system call to work on a
> > > vhost_net file descriptor.  If we do that, we can put the access back
> > > into a user thread (or two) that stays in splice indefinetely
> > 
> > An issue with exposing internal threading model to userspace
> > in this way is that we lose control of e.g. CPU locality -
> > and it is very hard for userspace to get it right.
> 
> Good point, I hadn't thought about that in this context.
> 
> For macvtap, my idea was to open the same tap char device multiple
> times and use each fd exclusively on one *guest* CPU. I'm not sure
> if virtio-net can already handle SMP guests efficiently. We might
> actually need to extend it to have more pairs of virtqueues, one
> for each guest CPU, which can then be bound to a host queue (or queue
> pair) in the physical nic.
> 
> Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET,
> VHOST_SET_OWNER, VHOST_RESET_OWNER

SET/RESET OWNER is still needed: otherwise if you share a descriptor
with another process, it can corrupt your memory.

> and your kernel thread with a new
> VHOST_NET_SPLICE blocking ioctl that does all the transfers in the
> context of the calling thread.

For one, you'd want a thread per virtqueue.  Second, an incoming traffic
might arrive on another CPU, we want to keep it local.  I guess you
would also want ioctls to wake up the threads spuriously ...

> This would improve the driver on various fronts:
> 
> - no need for playing tricks with use_mm/unuse_mm
> - possibly fewer global TLB flushes from switch_mm, which
>   may improve performance.

Why would there be less flushes?

> - ability to pass down error codes from socket or guest to
>   user space by returning from ioctl

virtio can not pass error codes. translation errors are
simple enough to just have a counter.

> - based on that, the ability to use any kind of file
>   descriptor that can do writev/readv or sendmsg/recvmsg
>   without the nastiness you mentioned.

Yes, it's an interesting approach. As I said, need to tread very
carefully though, I don't think all issues are figured out. For example:
what happens if we pass our own fd here? Will refcount on file ever get
to 0 on exit?  There may be others ...

> The disadvantage of course is that you need to add a user
> thread for each guest device to make up for the workqueue
> that you save.

More importantly, you lose control of CPU locality.  Simply put, a
natural threading model in virtualization is one thread per guest vcpu.
Asking applications to add multiple helper threads just so they can
block forever is wrong, IMO, as userspace has no idea which CPU
they should be on, what priority to use, etc.


> > > to
> > > avoid some of the implications of kernel threads like the missing
> > > ability to handle transfer errors in user space.
> > 
> > Are you talking about TCP here?
> > Transfer errors are typically asynchronous - possibly eventfd
> > as I expose for vhost net is sufficient there.
> 
> I mean errors in general if we allow random file descriptors to be used.
> E.g. tun_chr_aio_read could return EBADFD, EINVAL, EFAULT, ERESTARTSYS,
> EIO, EAGAIN and possibly others. We can handle some in kernel, others
> should never happen with vhost_net, but if something unexpected happens
> it would be nice to just bail out to user space.

And note that there might be more than one error.  I guess, that's
another problem with trying to layer on top of vfs.

> > > > I wonder - can we expose the underlying socket used by tap, or will that
> > > > create complex lifetime issues?
> > > 
> > > I think this could get more messy in the long run than calling vfs_readv
> > > on a random fd. It would mean deep internal knowledge of the tap driver
> > > in vhost_net, which I really would prefer to avoid.
> > 
> > No, what I had in mind is adding a GET_SOCKET ioctl to tap.
> > vhost would then just use the socket.
> 
> Right, that would work with tun/tap at least. It sounds a bit fishy
> but I can't see a reason why it would be hard to do.
> I'd have to think about how to get it working with macvtap, or if
> there is much value left in macvtap after that anyway.
> 
> > > So how about making the qemu command line interface an extension to
> > > what Or Gerlitz has done for the raw packet sockets?
> >
> > Not sure I see the connection, but I have not thought about qemu
> > side of things too much yet - trying to get kernel bits in place
> > first so that there's a stable ABI to work with.
> 
> Ok, fair enough. The kernel bits are obviously more time critical
> right now, since they should get into 2.6.32.
> 
> 	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Aug. 19, 2009, 3:27 p.m. UTC | #7
On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote:
> > On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> >
> > Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET,
> > VHOST_SET_OWNER, VHOST_RESET_OWNER
> 
> SET/RESET OWNER is still needed: otherwise if you share a descriptor
> with another process, it can corrupt your memory.

How? The point of using user threads is that you only ever access the
address space of the thread that called the ioctl.

> > and your kernel thread with a new
> > VHOST_NET_SPLICE blocking ioctl that does all the transfers in the
> > context of the calling thread.
> 
> For one, you'd want a thread per virtqueue.

Don't understand why. The thread can be blocked on all four ends
of the device and wake up whenever there is some work do to in
any direction. If we have data to be transferred in both ways,
we save one thread switch. It probably won't hurt much to have one
thread per direction, but I don't see the point.

> Second, an incoming traffic might arrive on another CPU, we want
> to keep it local.  I guess you would also want ioctls to wake up
> the threads spuriously ...

Outbound traffic should just stay on whatever CPU was sending it
from the guest. For inbound traffic, we should only wake up
the thread on the CPU that got the data to start with.

Why would I wake up the threads spuriously? Do you mean for
stopping the transmission or something else? I guess a pthread_kill
would be enough for shutting it down.

> > This would improve the driver on various fronts:
> > 
> > - no need for playing tricks with use_mm/unuse_mm
> > - possibly fewer global TLB flushes from switch_mm, which
> >   may improve performance.
> 
> Why would there be less flushes?

I just read up on task->active_mm handling. There probably wouldn't
be any. I got that wrong.

> > - based on that, the ability to use any kind of file
> >   descriptor that can do writev/readv or sendmsg/recvmsg
> >   without the nastiness you mentioned.
> 
> Yes, it's an interesting approach. As I said, need to tread very
> carefully though, I don't think all issues are figured out. For example:
> what happens if we pass our own fd here? Will refcount on file ever get
> to 0 on exit?  There may be others ...

right.

> > The disadvantage of course is that you need to add a user
> > thread for each guest device to make up for the workqueue
> > that you save.
> 
> More importantly, you lose control of CPU locality.  Simply put, a
> natural threading model in virtualization is one thread per guest vcpu.
> Asking applications to add multiple helper threads just so they can
> block forever is wrong, IMO, as userspace has no idea which CPU
> they should be on, what priority to use, etc.

But the kernel also doesn't know this, you get the same problem in
another way. If you have multiple guests running at different priorities,
the kernel will use those priorities to do the more important transfers
first, while with a global workqueue every guest gets the same priority.

You say that the natural model is to have one thread per guest
CPU, but you have a thread per host CPU instead. If the numbers
are different, you probably lose either way. It gets worse if you
try to apply NUMA policies.
 
> > > > to
> > > > avoid some of the implications of kernel threads like the missing
> > > > ability to handle transfer errors in user space.
> > > 
> > > Are you talking about TCP here?
> > > Transfer errors are typically asynchronous - possibly eventfd
> > > as I expose for vhost net is sufficient there.
> > 
> > I mean errors in general if we allow random file descriptors to be used.
> > E.g. tun_chr_aio_read could return EBADFD, EINVAL, EFAULT, ERESTARTSYS,
> > EIO, EAGAIN and possibly others. We can handle some in kernel, others
> > should never happen with vhost_net, but if something unexpected happens
> > it would be nice to just bail out to user space.
> 
> And note that there might be more than one error.  I guess, that's
> another problem with trying to layer on top of vfs.

Why is that different from any other system call? We just return when
we hit the first error condition.

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Aug. 20, 2009, 8:31 a.m. UTC | #8
On Wed, Aug 19, 2009 at 05:27:07PM +0200, Arnd Bergmann wrote:
> On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote:
> > > On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > >
> > > Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET,
> > > VHOST_SET_OWNER, VHOST_RESET_OWNER
> > 
> > SET/RESET OWNER is still needed: otherwise if you share a descriptor
> > with another process, it can corrupt your memory.
> 
> How? The point of using user threads is that you only ever access the
> address space of the thread that called the ioctl.

Think about this example with processes A and B sharing an fd:
A does SET_USED_ADDRESS
B does SET_USED_ADDRESS
A does VHOST_NET_SPLICE
See how stuff gets written into a random place in memory of A?

> > > and your kernel thread with a new
> > > VHOST_NET_SPLICE blocking ioctl that does all the transfers in the
> > > context of the calling thread.
> > 
> > For one, you'd want a thread per virtqueue.
> 
> Don't understand why. The thread can be blocked on all four ends
> of the device and wake up whenever there is some work do to in
> any direction. If we have data to be transferred in both ways,
> we save one thread switch. It probably won't hurt much to have one
> thread per direction, but I don't see the point.
> > Second, an incoming traffic might arrive on another CPU, we want
> > to keep it local.  I guess you would also want ioctls to wake up
> > the threads spuriously ...
> 
> Outbound traffic should just stay on whatever CPU was sending it
> from the guest. For inbound traffic, we should only wake up
> the thread on the CPU that got the data to start with.

Exactly. Since we have RX and TX virtqueues, this would mean thread per
direction or virtqueue, same thing. OTOH if RX and TX happen to run on the
same CPU, thread switch would be just a waste of time: but userspace
does not know it, only guest and kernel know.

> Why would I wake up the threads spuriously? Do you mean for
> stopping the transmission or something else? I guess a pthread_kill
> would be enough for shutting it down.

If you kill and restart them you lost priority etc parameters, but maybe.

> > > This would improve the driver on various fronts:
> > > 
> > > - no need for playing tricks with use_mm/unuse_mm
> > > - possibly fewer global TLB flushes from switch_mm, which
> > >   may improve performance.
> > 
> > Why would there be less flushes?
> 
> I just read up on task->active_mm handling. There probably wouldn't
> be any. I got that wrong.
> 
> > > - based on that, the ability to use any kind of file
> > >   descriptor that can do writev/readv or sendmsg/recvmsg
> > >   without the nastiness you mentioned.
> > 
> > Yes, it's an interesting approach. As I said, need to tread very
> > carefully though, I don't think all issues are figured out. For example:
> > what happens if we pass our own fd here? Will refcount on file ever get
> > to 0 on exit?  There may be others ...
> 
> right.
> 
> > > The disadvantage of course is that you need to add a user
> > > thread for each guest device to make up for the workqueue
> > > that you save.
> > 
> > More importantly, you lose control of CPU locality.  Simply put, a
> > natural threading model in virtualization is one thread per guest vcpu.
> > Asking applications to add multiple helper threads just so they can
> > block forever is wrong, IMO, as userspace has no idea which CPU
> > they should be on, what priority to use, etc.
> 
> But the kernel also doesn't know this, you get the same problem in
> another way. If you have multiple guests running at different priorities,
> the kernel will use those priorities to do the more important transfers
> first, while with a global workqueue every guest gets the same priority.

We could create more threads if this becomes a problem. I just think it
should be transparent to userspace. Possibly it's useful to look at the
packet header as well to decide on priority: this is something userspace
can't do.

> You say that the natural model is to have one thread per guest
> CPU,

Sorry I was not clear. I think userspace should create thread per guest.
We can create as many as we need for networking but I think this should
be behind the scenes, so userspace shouldn't bother with host CPUs, it
will just get it wrong. Think of CPU hotplug, interrupts migrating
between CPUs, etc ...

> but you have a thread per host CPU instead. If the numbers
> are different, you probably lose either way.

The trick I used is to keep as much as possible local
TX done on the CPU that runs the guest,
RX done on the CPU that runs the NIC interrupt.
a smart SMP guest sees which cpu gets interrupts
from NIC and schedules RX there, and it shouldn't matter
if the numbers of CPUs are different.

> It gets worse if you try to apply NUMA policies.

I believe the best we could do is avoid switching CPUs
until we know the actual destination.

> > > > > to
> > > > > avoid some of the implications of kernel threads like the missing
> > > > > ability to handle transfer errors in user space.
> > > > 
> > > > Are you talking about TCP here?
> > > > Transfer errors are typically asynchronous - possibly eventfd
> > > > as I expose for vhost net is sufficient there.
> > > 
> > > I mean errors in general if we allow random file descriptors to be used.
> > > E.g. tun_chr_aio_read could return EBADFD, EINVAL, EFAULT, ERESTARTSYS,
> > > EIO, EAGAIN and possibly others. We can handle some in kernel, others
> > > should never happen with vhost_net, but if something unexpected happens
> > > it would be nice to just bail out to user space.
> > 
> > And note that there might be more than one error.  I guess, that's
> > another problem with trying to layer on top of vfs.
> 
> Why is that different from any other system call?

With other system calls nothing happens while you process the error.
Here, the guest (other queues) and the network keep running (unless
there is a thread per queue, maybe we can block a queue, but we both
agreed above we don't want that).

> We just return when
> we hit the first error condition.
> 
> 	Arnd <><

If you assume losing the code for the second error condition is OK, why
is the first one so important?  That's why I used a counter (eventfd)
per virtqueue, on error userspace can scan the ring and poll the socket
and discover what's wrong, and counter ensures we can detect that error
happened while we were not looking.
Arnd Bergmann Aug. 20, 2009, 1:10 p.m. UTC | #9
On Thursday 20 August 2009, Michael S. Tsirkin wrote:
> On Wed, Aug 19, 2009 at 05:27:07PM +0200, Arnd Bergmann wrote:
> > On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > > On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote:
> > > > On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > > >
> > > > Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET,
> > > > VHOST_SET_OWNER, VHOST_RESET_OWNER
> > > 
> > > SET/RESET OWNER is still needed: otherwise if you share a descriptor
> > > with another process, it can corrupt your memory.
> > 
> > How? The point of using user threads is that you only ever access the
> > address space of the thread that called the ioctl.
> 
> Think about this example with processes A and B sharing an fd:
> A does SET_USED_ADDRESS
> B does SET_USED_ADDRESS
> A does VHOST_NET_SPLICE
> See how stuff gets written into a random place in memory of A?

Yes, I didn't think of that. It doesn't seem like a big problem
though, because it's a clear misuse of the API (I guess your
current code returns an error for one of the SET_USED_ADDRESS
ioctls), so I would see it as a classic garbage-in garbage-out
case.

It may even work in the case that the sharing of the fd resulted
from a fork, where the address contains the same buffer in both
processes. I can't think of a reason why you would want to use
it like that though.

> > Why would I wake up the threads spuriously? Do you mean for
> > stopping the transmission or something else? I guess a pthread_kill
> > would be enough for shutting it down.
> 
> If you kill and restart them you lost priority etc parameters, but maybe.

If you want to restart it, just send a nonfatal signal (SIGUSR1,
SIGRTMIN, ...) instead of a SIGKILL.

> > > More importantly, you lose control of CPU locality.  Simply put, a
> > > natural threading model in virtualization is one thread per guest vcpu.
> > > Asking applications to add multiple helper threads just so they can
> > > block forever is wrong, IMO, as userspace has no idea which CPU
> > > they should be on, what priority to use, etc.
> > 
> > But the kernel also doesn't know this, you get the same problem in
> > another way. If you have multiple guests running at different priorities,
> > the kernel will use those priorities to do the more important transfers
> > first, while with a global workqueue every guest gets the same priority.
> 
> We could create more threads if this becomes a problem. I just think it
> should be transparent to userspace. Possibly it's useful to look at the
> packet header as well to decide on priority: this is something userspace
> can't do.

Being transparent to user space would be nice, I agree. Letting user space
choose would also be nice, e.g. if you want to distribute eight available
hardware queue pairs to three guests in a non-obvious way. The
implementation depends to some degree on how we want to do multiqueue
RX/TX in virtio-net in the long run. For best cache locality and NUMA
behaviour, we might want to have one virtqueue per guest CPU and control
them independently from the host.

Priorities of the packets are dealt with in the packet scheduler for
external interfaces, I guess that is sufficient. I'm not sure if we
need to honor the same priorities for guest-to-guest communication,
my feeling is that we don't need to.

> > You say that the natural model is to have one thread per guest
> > CPU,
> 
> Sorry I was not clear. I think userspace should create thread per guest.
> We can create as many as we need for networking but I think this should
> be behind the scenes, so userspace shouldn't bother with host CPUs, it
> will just get it wrong. Think of CPU hotplug, interrupts migrating
> between CPUs, etc ...

Yes, I hope we can avoid letting qemu know about host CPUs.
I'm not sure we can avoid it completely, because something needs
to set physical IRQ affinity and such for the virtual devices
if you want to get the best locality.

> > but you have a thread per host CPU instead. If the numbers
> > are different, you probably lose either way.
> 
> The trick I used is to keep as much as possible local
> TX done on the CPU that runs the guest,
> RX done on the CPU that runs the NIC interrupt.
> a smart SMP guest sees which cpu gets interrupts
> from NIC and schedules RX there, and it shouldn't matter
> if the numbers of CPUs are different.

yes, that sounds good.

> > It gets worse if you try to apply NUMA policies.
> 
> I believe the best we could do is avoid switching CPUs
> until we know the actual destination.

My point is that the RX data in the guest address space
should be on the same NUMA node that gets the interrupt.

> > > And note that there might be more than one error.  I guess, that's
> > > another problem with trying to layer on top of vfs.
> > 
> > Why is that different from any other system call?
> 
> With other system calls nothing happens while you process the error.
> Here, the guest (other queues) and the network keep running (unless
> there is a thread per queue, maybe we can block a queue, but we both
> agreed above we don't want that).

Well, I would expect error conditions to be fatal for the connections
normally, so blocking the queue is totally fine here IMHO. The ioctl
would never return while a guest is running and connected to a
working NIC.

> > We just return when
> > we hit the first error condition.
> 
> If you assume losing the code for the second error condition is OK, why
> is the first one so important?  That's why I used a counter (eventfd)
> per virtqueue, on error userspace can scan the ring and poll the socket
> and discover what's wrong, and counter ensures we can detect that error
> happened while we were not looking.

I guess we were talking about different kinds of errors here, and I'm
still not sure which one you are talking about.

	Arnd <><

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Aug. 20, 2009, 1:38 p.m. UTC | #10
On Thu, Aug 20, 2009 at 03:10:54PM +0200, Arnd Bergmann wrote:
> On Thursday 20 August 2009, Michael S. Tsirkin wrote:
> > On Wed, Aug 19, 2009 at 05:27:07PM +0200, Arnd Bergmann wrote:
> > > On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > > > On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote:
> > > > > On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > > > >
> > > > > Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET,
> > > > > VHOST_SET_OWNER, VHOST_RESET_OWNER
> > > > 
> > > > SET/RESET OWNER is still needed: otherwise if you share a descriptor
> > > > with another process, it can corrupt your memory.
> > > 
> > > How? The point of using user threads is that you only ever access the
> > > address space of the thread that called the ioctl.
> > 
> > Think about this example with processes A and B sharing an fd:
> > A does SET_USED_ADDRESS
> > B does SET_USED_ADDRESS
> > A does VHOST_NET_SPLICE
> > See how stuff gets written into a random place in memory of A?
> 
> Yes, I didn't think of that. It doesn't seem like a big problem
> though, because it's a clear misuse of the API (I guess your
> current code returns an error for one of the SET_USED_ADDRESS
> ioctls), so I would see it as a classic garbage-in garbage-out
> case.
> 
> It may even work in the case that the sharing of the fd resulted
> from a fork, where the address contains the same buffer in both
> processes. I can't think of a reason why you would want to use
> it like that though.

It doesn't matter that I don't want this: allowing 1 process corrupt
another's memory is a security issue.  Once you get an fd, you want to
be able to use it without worrying that a bug in another process will
crash yours.

> > > Why would I wake up the threads spuriously? Do you mean for
> > > stopping the transmission or something else? I guess a pthread_kill
> > > would be enough for shutting it down.
> > 
> > If you kill and restart them you lost priority etc parameters, but maybe.
> 
> If you want to restart it, just send a nonfatal signal (SIGUSR1,
> SIGRTMIN, ...) instead of a SIGKILL.
> 
> > > > More importantly, you lose control of CPU locality.  Simply put, a
> > > > natural threading model in virtualization is one thread per guest vcpu.
> > > > Asking applications to add multiple helper threads just so they can
> > > > block forever is wrong, IMO, as userspace has no idea which CPU
> > > > they should be on, what priority to use, etc.
> > > 
> > > But the kernel also doesn't know this, you get the same problem in
> > > another way. If you have multiple guests running at different priorities,
> > > the kernel will use those priorities to do the more important transfers
> > > first, while with a global workqueue every guest gets the same priority.
> > 
> > We could create more threads if this becomes a problem. I just think it
> > should be transparent to userspace. Possibly it's useful to look at the
> > packet header as well to decide on priority: this is something userspace
> > can't do.
> 
> Being transparent to user space would be nice, I agree. Letting user space
> choose would also be nice, e.g. if you want to distribute eight available
> hardware queue pairs to three guests in a non-obvious way. The
> implementation depends to some degree on how we want to do multiqueue
> RX/TX in virtio-net in the long run. For best cache locality and NUMA
> behaviour, we might want to have one virtqueue per guest CPU and control
> them independently from the host.
> 
> Priorities of the packets are dealt with in the packet scheduler for
> external interfaces, I guess that is sufficient. I'm not sure if we
> need to honor the same priorities for guest-to-guest communication,
> my feeling is that we don't need to.
> 
> > > You say that the natural model is to have one thread per guest
> > > CPU,
> > 
> > Sorry I was not clear. I think userspace should create thread per guest.
> > We can create as many as we need for networking but I think this should
> > be behind the scenes, so userspace shouldn't bother with host CPUs, it
> > will just get it wrong. Think of CPU hotplug, interrupts migrating
> > between CPUs, etc ...
> 
> Yes, I hope we can avoid letting qemu know about host CPUs.

So ... if we need per-host-CPU threads, we'll have to create them
ourselves then.

> I'm not sure we can avoid it completely, because something needs
> to set physical IRQ affinity and such for the virtual devices
> if you want to get the best locality.
> 
> > > but you have a thread per host CPU instead. If the numbers
> > > are different, you probably lose either way.
> > 
> > The trick I used is to keep as much as possible local
> > TX done on the CPU that runs the guest,
> > RX done on the CPU that runs the NIC interrupt.
> > a smart SMP guest sees which cpu gets interrupts
> > from NIC and schedules RX there, and it shouldn't matter
> > if the numbers of CPUs are different.
> 
> yes, that sounds good.
> 
> > > It gets worse if you try to apply NUMA policies.
> > 
> > I believe the best we could do is avoid switching CPUs
> > until we know the actual destination.
> 
> My point is that the RX data in the guest address space
> should be on the same NUMA node that gets the interrupt.

Exactly. And since I know this, I just do the right thing
instead of expecting userspace to do this with taskset and
stuff, which no one seems to get right anyway.

> > > > And note that there might be more than one error.  I guess, that's
> > > > another problem with trying to layer on top of vfs.
> > > 
> > > Why is that different from any other system call?
> > 
> > With other system calls nothing happens while you process the error.
> > Here, the guest (other queues) and the network keep running (unless
> > there is a thread per queue, maybe we can block a queue, but we both
> > agreed above we don't want that).
> 
> Well, I would expect error conditions to be fatal for the connections
> normally,

Not necessarily. E.g. one can imagine that userspace wants to handle
access to specific address ranges (framebuffer?) in slow path, and let
kernel handle the rest. A natural way is to make this range generate
access error.

> so blocking the queue is totally fine here IMHO. The ioctl
> would never return while a guest is running and connected to a
> working NIC.
> 
> > > We just return when
> > > we hit the first error condition.
> > 
> > If you assume losing the code for the second error condition is OK, why
> > is the first one so important?  That's why I used a counter (eventfd)
> > per virtqueue, on error userspace can scan the ring and poll the socket
> > and discover what's wrong, and counter ensures we can detect that error
> > happened while we were not looking.
> 
> I guess we were talking about different kinds of errors here, and I'm
> still not sure which one you are talking about.
> 
> 	Arnd <><

Non fatal errors. E.g. translation errors probably should be
non-fatal. I can also imagine working around guest bugs in
userspace.
Arnd Bergmann Aug. 20, 2009, 2:31 p.m. UTC | #11
On Thursday 20 August 2009, Michael S. Tsirkin wrote:
> On Thu, Aug 20, 2009 at 03:10:54PM +0200, Arnd Bergmann wrote:
> > On Thursday 20 August 2009, Michael S. Tsirkin wrote:
>
> It doesn't matter that I don't want this: allowing 1 process corrupt
> another's memory is a security issue.  Once you get an fd, you want to
> be able to use it without worrying that a bug in another process will
> crash yours.

Ok, got it. Yes, that would be inacceptable.

> > > If you assume losing the code for the second error condition is OK, why
> > > is the first one so important?  That's why I used a counter (eventfd)
> > > per virtqueue, on error userspace can scan the ring and poll the socket
> > > and discover what's wrong, and counter ensures we can detect that error
> > > happened while we were not looking.
> > 
> > I guess we were talking about different kinds of errors here, and I'm
> > still not sure which one you are talking about.
> > 
> Non fatal errors. E.g. translation errors probably should be
> non-fatal. I can also imagine working around guest bugs in
> userspace.

Ah, so I guess the confusion was that I was worried about
errors coming from the socket file descriptor, while you
were thinking of errors from the guest side, which I did not
expect to happen.

The errors from the socket (or chardev, as that was the
start of the argument) should still fit into the categories
that I mentioned, either they can be handled by the host
kernel, or they are fatal.

I'll read up in your code to see how you handle asynchronous
non-fatal errors from the guest. Intuitively, I'd still
assume that returning the first error should be enough
because it will typically mean that you cannot continue
without fixing it up first, and you might get the next
error immediately after that.

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Aug. 20, 2009, 2:42 p.m. UTC | #12
On Thu, Aug 20, 2009 at 04:31:36PM +0200, Arnd Bergmann wrote:
> On Thursday 20 August 2009, Michael S. Tsirkin wrote:
> > On Thu, Aug 20, 2009 at 03:10:54PM +0200, Arnd Bergmann wrote:
> > > On Thursday 20 August 2009, Michael S. Tsirkin wrote:
> >
> > It doesn't matter that I don't want this: allowing 1 process corrupt
> > another's memory is a security issue.  Once you get an fd, you want to
> > be able to use it without worrying that a bug in another process will
> > crash yours.
> 
> Ok, got it. Yes, that would be inacceptable.
> 
> > > > If you assume losing the code for the second error condition is OK, why
> > > > is the first one so important?  That's why I used a counter (eventfd)
> > > > per virtqueue, on error userspace can scan the ring and poll the socket
> > > > and discover what's wrong, and counter ensures we can detect that error
> > > > happened while we were not looking.
> > > 
> > > I guess we were talking about different kinds of errors here, and I'm
> > > still not sure which one you are talking about.
> > > 
> > Non fatal errors. E.g. translation errors probably should be
> > non-fatal. I can also imagine working around guest bugs in
> > userspace.
> 
> Ah, so I guess the confusion was that I was worried about
> errors coming from the socket file descriptor, while you
> were thinking of errors from the guest side, which I did not
> expect to happen.
> 
> The errors from the socket (or chardev, as that was the
> start of the argument) should still fit into the categories
> that I mentioned, either they can be handled by the host
> kernel, or they are fatal.

Hmm, are you sure? Imagine a device going away while socket is bound to
it.  You get -ENXIO. It's not fatal in a sense that you can bind the
socket to another device and go on, right?

> I'll read up in your code to see how you handle asynchronous
> non-fatal errors from the guest. Intuitively, I'd still
> assume that returning the first error should be enough
> because it will typically mean that you cannot continue
> without fixing it up first, and you might get the next
> error immediately after that.
> 
> 	Arnd <><

Yes, but need to be careful not to lose that next error, and these
errors only block one queue. I handle this by reporting them on an
eventfd, per vq, which userspace can poll.
Arnd Bergmann Aug. 20, 2009, 3:10 p.m. UTC | #13
On Thursday 20 August 2009, Michael S. Tsirkin wrote:
> 
> > The errors from the socket (or chardev, as that was the
> > start of the argument) should still fit into the categories
> > that I mentioned, either they can be handled by the host
> > kernel, or they are fatal.
> 
> Hmm, are you sure? Imagine a device going away while socket is bound to
> it.  You get -ENXIO. It's not fatal in a sense that you can bind the
> socket to another device and go on, right?

Right. Not fatal in that sense, but fatal in the sense that I
can no longer transmit other frames until you recover. I think
we both meant the same here.

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov Aug. 21, 2009, 1:20 p.m. UTC | #14
On Thu, Aug 20, 2009 at 04:38:17PM +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 20, 2009 at 03:10:54PM +0200, Arnd Bergmann wrote:
> > On Thursday 20 August 2009, Michael S. Tsirkin wrote:
> > > On Wed, Aug 19, 2009 at 05:27:07PM +0200, Arnd Bergmann wrote:
> > > > On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > > > > On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote:
> > > > > > On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > > > > >
> > > > > > Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET,
> > > > > > VHOST_SET_OWNER, VHOST_RESET_OWNER
> > > > > 
> > > > > SET/RESET OWNER is still needed: otherwise if you share a descriptor
> > > > > with another process, it can corrupt your memory.
> > > > 
> > > > How? The point of using user threads is that you only ever access the
> > > > address space of the thread that called the ioctl.
> > > 
> > > Think about this example with processes A and B sharing an fd:
> > > A does SET_USED_ADDRESS
> > > B does SET_USED_ADDRESS
> > > A does VHOST_NET_SPLICE
> > > See how stuff gets written into a random place in memory of A?
> > 
> > Yes, I didn't think of that. It doesn't seem like a big problem
> > though, because it's a clear misuse of the API (I guess your
> > current code returns an error for one of the SET_USED_ADDRESS
> > ioctls), so I would see it as a classic garbage-in garbage-out
> > case.
> > 
> > It may even work in the case that the sharing of the fd resulted
> > from a fork, where the address contains the same buffer in both
> > processes. I can't think of a reason why you would want to use
> > it like that though.
> 
> It doesn't matter that I don't want this: allowing 1 process corrupt
> another's memory is a security issue.  Once you get an fd, you want to
> be able to use it without worrying that a bug in another process will
> crash yours.
> 
If B's SET_USED_ADDRESS fails how one process can corrupt a memory of
other process?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index ebc2691..eb0c1da 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6312,6 +6312,16 @@  S:	Maintained
 F:	Documentation/filesystems/vfat.txt
 F:	fs/fat/
 
+VIRTIO HOST (VHOST)
+P:	Michael S. Tsirkin
+M:	mst@redhat.com
+L:	kvm@vger.kernel.org
+L:	virtualization@lists.osdl.org
+L:	netdev@vger.kernel.org
+S:	Maintained
+F:	drivers/vhost/
+F:	include/linux/vhost.h
+
 VIA RHINE NETWORK DRIVER
 P:	Roger Luethi
 M:	rl@hellgate.ch
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index b84e571..94f44d9 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -64,6 +64,7 @@  config KVM_AMD
 
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
+source drivers/vhost/Kconfig
 source drivers/lguest/Kconfig
 source drivers/virtio/Kconfig
 
diff --git a/drivers/Makefile b/drivers/Makefile
index bc4205d..1551ae1 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -105,6 +105,7 @@  obj-$(CONFIG_HID)		+= hid/
 obj-$(CONFIG_PPC_PS3)		+= ps3/
 obj-$(CONFIG_OF)		+= of/
 obj-$(CONFIG_SSB)		+= ssb/
+obj-$(CONFIG_VHOST_NET)		+= vhost/
 obj-$(CONFIG_VIRTIO)		+= virtio/
 obj-$(CONFIG_VLYNQ)		+= vlynq/
 obj-$(CONFIG_STAGING)		+= staging/
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
new file mode 100644
index 0000000..d955406
--- /dev/null
+++ b/drivers/vhost/Kconfig
@@ -0,0 +1,11 @@ 
+config VHOST_NET
+	tristate "Host kernel accelerator for virtio net"
+	depends on NET && EVENTFD
+	---help---
+	  This kernel module can be loaded in host kernel to accelerate
+	  guest networking with virtio_net. Not to be confused with virtio_net
+	  module itself which needs to be loaded in guest kernel.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called vhost_net.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
new file mode 100644
index 0000000..72dd020
--- /dev/null
+++ b/drivers/vhost/Makefile
@@ -0,0 +1,2 @@ 
+obj-$(CONFIG_VHOST_NET) += vhost_net.o
+vhost_net-y := vhost.o net.o
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
new file mode 100644
index 0000000..728094b
--- /dev/null
+++ b/drivers/vhost/net.c
@@ -0,0 +1,429 @@ 
+/* Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * virtio-net server in host kernel.
+ */
+
+#include <linux/compat.h>
+#include <linux/eventfd.h>
+#include <linux/vhost.h>
+#include <linux/virtio_net.h>
+#include <linux/mmu_context.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include <linux/rcupdate.h>
+#include <linux/file.h>
+
+#include <linux/net.h>
+#include <linux/if_packet.h>
+#include <linux/if_arp.h>
+
+#include <net/sock.h>
+
+#include <asm/mmu_context.h>
+
+#include "vhost.h"
+
+enum {
+	VHOST_NET_VQ_RX = 0,
+	VHOST_NET_VQ_TX = 1,
+	VHOST_NET_VQ_MAX = 2,
+};
+
+struct vhost_net {
+	struct vhost_dev dev;
+	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
+	/* We use a kind of RCU to access sock pointer.
+	 * All readers access it from workqueue, which makes it possible to
+	 * flush the workqueue instead of synchronize_rcu. Therefore readers do
+	 * not need to call rcu_read_lock/rcu_read_unlock: the beginning of
+	 * work item execution acts instead of rcu_read_lock() and the end of
+	 * work item execution acts instead of rcu_read_lock().
+	 * Writers use device mutex. */
+	struct socket *sock;
+	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+};
+
+/* Expects to be always run from workqueue - which acts as
+ * read-size critical section for our kind of RCU. */
+static void handle_tx(struct vhost_net *net)
+{
+	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+	unsigned head, out, in;
+	struct msghdr msg = {
+		.msg_name = NULL,
+		.msg_namelen = 0,
+		.msg_control = NULL,
+		.msg_controllen = 0,
+		.msg_iov = (struct iovec *)vq->iov + 1,
+		.msg_flags = MSG_DONTWAIT,
+	};
+	size_t len;
+	int err;
+	struct socket *sock = rcu_dereference(net->sock);
+	if (!sock || !sock_writeable(sock->sk))
+		return;
+
+	use_mm(net->dev.mm);
+	mutex_lock(&vq->mutex);
+	for (;;) {
+		head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in);
+		/* Nothing new?  Wait for eventfd to tell us they refilled. */
+		if (head == vq->num)
+			break;
+		if (out <= 1 || in) {
+			vq_err(vq, "Unexpected descriptor format for TX: "
+			       "out %d, int %d\n", out, in);
+			break;
+		}
+		/* Sanity check */
+		if (vq->iov->iov_len != sizeof(struct virtio_net_hdr)) {
+			vq_err(vq, "Unexpected header len for TX: "
+			       "%ld expected %zd\n", vq->iov->iov_len,
+			       sizeof(struct virtio_net_hdr));
+			break;
+		}
+		/* Skip header. TODO: support TSO. */
+		msg.msg_iovlen = out - 1;
+		len = iov_length(vq->iov + 1, out - 1);
+		/* TODO: Check specific error and bomb out unless ENOBUFS? */
+		err = sock->ops->sendmsg(NULL, sock, &msg, len);
+		if (err < 0) {
+			vhost_discard_vq_desc(vq);
+			break;
+		}
+		if (err != len)
+			pr_err("Truncated TX packet: "
+			       " len %d != %zd\n", err, len);
+		vhost_add_used_and_trigger(vq, head,
+				     len + sizeof(struct virtio_net_hdr));
+	}
+
+	mutex_unlock(&vq->mutex);
+	unuse_mm(net->dev.mm);
+}
+
+/* Expects to be always run from workqueue - which acts as
+ * read-size critical section for our kind of RCU. */
+static void handle_rx(struct vhost_net *net)
+{
+	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
+	unsigned head, out, in;
+	struct msghdr msg = {
+		.msg_name = NULL,
+		.msg_namelen = 0,
+		.msg_control = NULL, /* FIXME: get and handle RX aux data. */
+		.msg_controllen = 0,
+		.msg_iov = vq->iov + 1,
+		.msg_flags = MSG_DONTWAIT,
+	};
+
+	struct virtio_net_hdr hdr = {
+		.flags = 0,
+		.gso_type = VIRTIO_NET_HDR_GSO_NONE
+	};
+
+	size_t len;
+	int err;
+	struct socket *sock = rcu_dereference(net->sock);
+	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+		return;
+
+	use_mm(net->dev.mm);
+	mutex_lock(&vq->mutex);
+
+	for (;;) {
+		head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in);
+		if (head == vq->num)
+			break;
+		if (in <= 1 || out) {
+			vq_err(vq, "Unexpected descriptor format for RX: out %d, int %d\n",
+			       out, in);
+			break;
+		}
+		/* Sanity check */
+		if (vq->iov->iov_len != sizeof(struct virtio_net_hdr)) {
+			vq_err(vq, "Unexpected header len for RX: %ld expected %zd\n",
+			       vq->iov->iov_len, sizeof(struct virtio_net_hdr));
+			break;
+		}
+		/* Skip header. TODO: support TSO/mergeable rx buffers. */
+		msg.msg_iovlen = in - 1;
+		len = iov_length(vq->iov + 1, in - 1);
+		err = sock->ops->recvmsg(NULL, sock, &msg,
+					 len, MSG_DONTWAIT | MSG_TRUNC);
+		/* TODO: Check specific error and bomb out unless EAGAIN? */
+		if (err < 0) {
+			vhost_discard_vq_desc(vq);
+			break;
+		}
+		/* TODO: Should check and handle checksum. */
+		if (err > len) {
+			pr_err("Discarded truncated rx packet: "
+			       " len %d > %zd\n", err, len);
+			vhost_discard_vq_desc(vq);
+			continue;
+		}
+		len = err;
+		err = copy_to_user(vq->iov->iov_base, &hdr, sizeof hdr);
+		if (err) {
+			vq_err(vq, "Unable to write vnet_hdr at addr %p: %d\n",
+			       vq->iov->iov_base, err);
+			break;
+		}
+		vhost_add_used_and_trigger(vq, head, len + sizeof hdr);
+	}
+
+	mutex_unlock(&vq->mutex);
+	unuse_mm(net->dev.mm);
+}
+
+static void handle_tx_kick(struct work_struct *work)
+{
+	struct vhost_virtqueue *vq;
+	struct vhost_net *net;
+	vq = container_of(work, struct vhost_virtqueue, poll.work);
+	net = container_of(vq->dev, struct vhost_net, dev);
+	handle_tx(net);
+}
+
+static void handle_rx_kick(struct work_struct *work)
+{
+	struct vhost_virtqueue *vq;
+	struct vhost_net *net;
+	vq = container_of(work, struct vhost_virtqueue, poll.work);
+	net = container_of(vq->dev, struct vhost_net, dev);
+	handle_rx(net);
+}
+
+static void handle_tx_net(struct work_struct *work)
+{
+	struct vhost_net *net;
+	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_TX].work);
+	handle_tx(net);
+}
+
+static void handle_rx_net(struct work_struct *work)
+{
+	struct vhost_net *net;
+	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_RX].work);
+	handle_rx(net);
+}
+
+static int vhost_net_open(struct inode *inode, struct file *f)
+{
+	struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
+	int r;
+	if (!n)
+		return -ENOMEM;
+	f->private_data = n;
+	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
+	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
+	r = vhost_dev_init(&n->dev, n->vqs, VHOST_NET_VQ_MAX);
+	if (r < 0) {
+		kfree(n);
+		return r;
+	}
+
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
+	return 0;
+}
+
+static struct socket *vhost_net_stop(struct vhost_net *n)
+{
+	struct socket *sock = n->sock;
+	rcu_assign_pointer(n->sock, NULL);
+	if (sock) {
+		vhost_poll_flush(n->poll + VHOST_NET_VQ_TX);
+		vhost_poll_flush(n->poll + VHOST_NET_VQ_RX);
+	}
+	return sock;
+}
+
+static int vhost_net_release(struct inode *inode, struct file *f)
+{
+	struct vhost_net *n = f->private_data;
+	struct socket *sock;
+
+	sock = vhost_net_stop(n);
+	vhost_dev_cleanup(&n->dev);
+	if (sock)
+		fput(sock->file);
+	kfree(n);
+	return 0;
+}
+
+static long vhost_net_set_socket(struct vhost_net *n, int fd)
+{
+	struct {
+		struct sockaddr_ll sa;
+		char  buf[MAX_ADDR_LEN];
+	} uaddr;
+	struct socket *sock, *oldsock = NULL;
+	int uaddr_len = sizeof uaddr, r;
+
+	mutex_lock(&n->dev.mutex);
+	r = vhost_dev_check_owner(&n->dev);
+	if (r)
+		goto done;
+
+	if (fd == -1) {
+		/* Disconnect from socket and device. */
+		oldsock = vhost_net_stop(n);
+		goto done;
+	}
+	
+	sock = sockfd_lookup(fd, &r);
+	if (!sock) {
+		r = -ENOTSOCK;
+		goto done;
+	}
+
+	/* Parameter checking */
+	if (sock->sk->sk_type != SOCK_RAW) {
+		r = -ESOCKTNOSUPPORT;
+		goto done;
+	}
+
+	r = sock->ops->getname(sock, (struct sockaddr *)&uaddr.sa,
+			       &uaddr_len, 0);
+	if (r)
+		goto done;
+
+	if (uaddr.sa.sll_family != AF_PACKET) {
+		r = -EPFNOSUPPORT;
+		goto done;
+	}
+
+	/* start polling new socket */
+	if (sock == oldsock)
+		goto done;
+
+	if (oldsock) {
+		vhost_poll_stop(n->poll + VHOST_NET_VQ_TX);
+		vhost_poll_stop(n->poll + VHOST_NET_VQ_RX);
+	}
+	oldsock = n->sock;
+	rcu_assign_pointer(n->sock, sock);
+	vhost_poll_start(n->poll + VHOST_NET_VQ_TX, sock->file);
+	vhost_poll_start(n->poll + VHOST_NET_VQ_RX, sock->file);
+done:
+	mutex_unlock(&n->dev.mutex);
+	if (oldsock) {
+		vhost_poll_flush(n->poll + VHOST_NET_VQ_TX);
+		vhost_poll_flush(n->poll + VHOST_NET_VQ_RX);
+		vhost_poll_flush(&n->dev.vqs[VHOST_NET_VQ_TX].poll);
+		vhost_poll_flush(&n->dev.vqs[VHOST_NET_VQ_RX].poll);
+		fput(oldsock->file);
+	}
+	return r;
+}
+
+static long vhost_net_reset_owner(struct vhost_net *n)
+{
+	struct socket *sock = NULL;
+	long r;
+	mutex_lock(&n->dev.mutex);
+	r = vhost_dev_check_owner(&n->dev);
+	if (r)
+		goto done;
+	sock = vhost_net_stop(n);
+	r = vhost_dev_reset_owner(&n->dev);
+done:
+	mutex_unlock(&n->dev.mutex);
+	if (sock)
+		fput(sock->file);
+	return r;
+}
+
+static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
+			    unsigned long arg)
+{
+	struct vhost_net *n = f->private_data;
+	void __user *argp = (void __user *)arg;
+	u32 __user *featurep = argp;
+	int __user *fdp = argp;
+	u32 features;
+	int fd, r;
+	switch (ioctl) {
+	case VHOST_NET_SET_SOCKET:
+		r = get_user(fd, fdp);
+		if (r < 0)
+			return r;
+		return vhost_net_set_socket(n, fd);
+	case VHOST_GET_FEATURES:
+		/* No features for now */
+		features = 0;
+		return put_user(features, featurep);
+	case VHOST_ACK_FEATURES:
+		r = get_user(features, featurep);
+		/* No features for now */
+		if (r < 0)
+			return r;
+		if (features)
+			return -EOPNOTSUPP;
+		return 0;
+	case VHOST_RESET_OWNER:
+		return vhost_net_reset_owner(n);
+	default:
+		return vhost_dev_ioctl(&n->dev, ioctl, arg);
+	}
+}
+
+#ifdef CONFIG_COMPAT
+static long vhost_net_compat_ioctl(struct file *f, unsigned int ioctl,
+				   unsigned long arg)
+{
+	return vhost_net_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static struct file_operations vhost_net_fops = {
+	.owner          = THIS_MODULE,
+	.release        = vhost_net_release,
+	.unlocked_ioctl = vhost_net_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl   = vhost_net_compat_ioctl,
+#endif
+	.open           = vhost_net_open,
+};
+
+static struct miscdevice vhost_net_misc = {
+	VHOST_NET_MINOR,
+	"vhost-net",
+	&vhost_net_fops,
+};
+
+int vhost_net_init(void)
+{
+	int r = vhost_init();
+	if (r)
+		goto err_init;
+	r = misc_register(&vhost_net_misc);
+	if (r)
+		goto err_reg;
+	return 0;
+err_reg:
+	vhost_cleanup();
+err_init:
+	return r;
+
+}
+module_init(vhost_net_init);
+
+void vhost_net_exit(void)
+{
+	misc_deregister(&vhost_net_misc);
+	vhost_cleanup();
+}
+module_exit(vhost_net_exit);
+
+MODULE_VERSION("0.0.1");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Michael S. Tsirkin");
+MODULE_DESCRIPTION("Host kernel accelerator for virtio net");
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
new file mode 100644
index 0000000..a52b18b
--- /dev/null
+++ b/drivers/vhost/vhost.c
@@ -0,0 +1,663 @@ 
+/* Copyright (C) 2009 Red Hat, Inc.
+ * Copyright (C) 2006 Rusty Russell IBM Corporation
+ *
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ *
+ * Inspiration, some code, and most witty comments come from
+ * Documentation/lguest/lguest.c, by Rusty Russell
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * Generic code for virtio server in host kernel.
+ */
+
+#include <linux/eventfd.h>
+#include <linux/vhost.h>
+#include <linux/virtio_net.h>
+#include <linux/mm.h>
+#include <linux/miscdevice.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include <linux/rcupdate.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+
+#include <linux/net.h>
+#include <linux/if_packet.h>
+#include <linux/if_arp.h>
+
+#include <net/sock.h>
+
+#include <asm/mmu_context.h>
+
+#include "vhost.h"
+
+enum {
+	VHOST_MEMORY_MAX_NREGIONS = 64,
+};
+
+static struct workqueue_struct *vhost_workqueue;
+
+static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
+			    poll_table *pt)
+{
+	struct vhost_poll *poll;
+	poll = container_of(pt, struct vhost_poll, table);
+
+	poll->wqh = wqh;
+	add_wait_queue(wqh, &poll->wait);
+}
+
+static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct vhost_poll *poll;
+	poll = container_of(wait, struct vhost_poll, wait);
+	if (!((unsigned long)key & poll->mask))
+		return 0;
+
+	queue_work(vhost_workqueue, &poll->work);
+	return 0;
+}
+
+/* Init poll structure */
+void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
+		     unsigned long mask)
+{
+	INIT_WORK(&poll->work, func);
+	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
+	init_poll_funcptr(&poll->table, vhost_poll_func);
+	poll->mask = mask;
+}
+
+/* Start polling a file. We add ourselves to file's wait queue. The user must
+ * keep a reference to a file until after vhost_poll_stop is called. */
+void vhost_poll_start(struct vhost_poll *poll, struct file *file)
+{
+	unsigned long mask;
+	mask = file->f_op->poll(file, &poll->table);
+	if (mask)
+		vhost_poll_wakeup(&poll->wait, 0, 0, (void *)mask);
+}
+
+/* Stop polling a file. After this function returns, it becomes safe to drop the
+ * file reference. You must also flush afterwards. */
+void vhost_poll_stop(struct vhost_poll *poll)
+{
+	remove_wait_queue(poll->wqh, &poll->wait);
+}
+
+/* Flush any work that has been scheduled. When calling this, don't hold any
+ * locks that are also used by the callback. */
+void vhost_poll_flush(struct vhost_poll *poll)
+{
+	flush_work(&poll->work);
+}
+
+long vhost_dev_init(struct vhost_dev *dev, struct vhost_virtqueue *vqs, int nvqs)
+{
+	int i;
+	dev->vqs = vqs;
+	dev->nvqs = nvqs;
+	mutex_init(&dev->mutex);
+
+	for(i = 0; i < dev->nvqs; ++i) {
+		dev->vqs[i].dev = dev;
+		mutex_init(&dev->vqs[i].mutex);
+		if (dev->vqs[i].handle_kick)
+			vhost_poll_init(&dev->vqs[i].poll,
+					dev->vqs[i].handle_kick,
+					POLLIN);
+	}
+	return 0;
+}
+
+/* User should have device mutex */
+long vhost_dev_check_owner(struct vhost_dev *dev)
+{
+	return dev->mm == current->mm ? 0 : -EPERM;
+}
+
+/* User should have device mutex */
+static long vhost_dev_set_owner(struct vhost_dev *dev)
+{
+	if (dev->mm)
+		return -EBUSY;
+	dev->mm = get_task_mm(current);
+	return 0;
+}
+
+/* User should have device mutex */
+long vhost_dev_reset_owner(struct vhost_dev *dev)
+{
+	struct vhost_memory *memory;
+
+	/* Restore memory to default 1:1 mapping. */
+	memory = kmalloc(offsetof(struct vhost_memory, regions) +
+			 2 * sizeof *memory->regions, GFP_KERNEL);
+	if (!memory)
+		return -ENOMEM;
+
+	vhost_dev_cleanup(dev);
+
+	memory->nregions = 2;
+	memory->regions[0].guest_phys_addr = 1;
+	memory->regions[0].userspace_addr = 1;
+	memory->regions[0].memory_size = ~0ULL;
+	memory->regions[1].guest_phys_addr = 0;
+	memory->regions[1].userspace_addr = 0;
+	memory->regions[1].memory_size = 1;
+	dev->memory = memory;
+	return 0;
+}
+
+/* User should have device mutex */
+void vhost_dev_cleanup(struct vhost_dev *dev)
+{
+	int i;
+	for(i = 0; i < dev->nvqs; ++i) {
+		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
+			vhost_poll_stop(&dev->vqs[i].poll);
+			vhost_poll_flush(&dev->vqs[i].poll);
+		}
+		if (dev->vqs[i].error_ctx)
+			eventfd_ctx_put(dev->vqs[i].error_ctx);
+		if (dev->vqs[i].error)
+			fput(dev->vqs[i].error);
+		if (dev->vqs[i].kick)
+			fput(dev->vqs[i].kick);
+		if (dev->vqs[i].call_ctx)
+			eventfd_ctx_put(dev->vqs[i].call_ctx);
+		if (dev->vqs[i].call)
+			fput(dev->vqs[i].call);
+		dev->vqs[i].error_ctx = NULL;
+		dev->vqs[i].error = NULL;
+		dev->vqs[i].kick = NULL;
+		dev->vqs[i].call_ctx = NULL;
+		dev->vqs[i].call = NULL;
+	}
+	/* No one will access memory at this point */
+	kfree(dev->memory);
+	dev->memory = NULL;
+	if (dev->mm)
+		mmput(dev->mm);
+	dev->mm = NULL;
+}
+
+static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
+{
+	struct vhost_memory mem, *newmem, *oldmem;
+	unsigned long size = offsetof(struct vhost_memory, regions);
+	long r;
+	r = copy_from_user(&mem, m, size);
+	if (r)
+		return r;
+	if (mem.padding)
+		return -EOPNOTSUPP;
+	if (mem.nregions > VHOST_MEMORY_MAX_NREGIONS)
+		return -E2BIG;
+	newmem = kmalloc(size + mem.nregions * sizeof *m->regions, GFP_KERNEL);
+	if (!newmem)
+		return -ENOMEM;
+
+	memcpy(newmem, &mem, size);
+	r = copy_from_user(newmem->regions, m->regions,
+			   mem.nregions * sizeof *m->regions);
+	if (r) {
+		kfree(newmem);
+		return r;
+	}
+	oldmem = d->memory;
+	rcu_assign_pointer(d->memory, newmem);
+	synchronize_rcu();
+	kfree(oldmem);
+	return 0;
+}
+
+static int init_used(struct vhost_virtqueue *vq)
+{
+	u16 flags = 0;
+	int r = put_user(flags, &vq->used->flags);
+	if (r)
+		return r;
+	return get_user(vq->last_used_idx, &vq->used->idx);
+}
+
+static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
+{
+	struct file *eventfp, *filep = NULL, *pollstart = NULL, *pollstop = NULL;
+	struct eventfd_ctx *ctx = NULL;
+	u32 __user *idxp = argp;
+	struct vhost_virtqueue *vq;
+	struct vhost_vring_state s;
+	struct vhost_vring_file f;
+	struct vhost_vring_addr a;
+	u32 idx;
+	long r;
+
+	r = get_user(idx, idxp);
+	if (r < 0)
+		return r;
+	if (idx > d->nvqs)
+		return -ENOBUFS;
+
+	vq = d->vqs + idx;
+
+	mutex_lock(&vq->mutex);
+
+	switch (ioctl) {
+	case VHOST_SET_VRING_NUM:
+		r = copy_from_user(&s, argp, sizeof s);
+		if (r < 0)
+			break;
+		if (s.num > 0xffff) {
+			r = -EINVAL;
+			break;
+		}
+		vq->num = s.num;
+		break;
+	case VHOST_SET_VRING_BASE:
+		r = copy_from_user(&s, argp, sizeof s);
+		if (r < 0)
+			break;
+		if (s.num > 0xffff) {
+			r = -EINVAL;
+			break;
+		}
+		vq->last_avail_idx = s.num;
+		break;
+	case VHOST_GET_VRING_BASE:
+		s.index = idx;
+		s.num = vq->last_avail_idx;
+		r = copy_to_user(argp, &s, sizeof s);
+		break;
+	case VHOST_SET_VRING_DESC:
+		r = copy_from_user(&a, argp, sizeof a);
+		if (r < 0)
+			break;
+		if (a.padding) {
+			r = -EOPNOTSUPP;
+			break;
+		}
+		if ((u64)(long)a.user_addr != a.user_addr) {
+			r = -EFAULT;
+			break;
+		}
+		vq->desc = (void __user *)(long)a.user_addr;
+		break;
+	case VHOST_SET_VRING_AVAIL:
+		r = copy_from_user(&a, argp, sizeof a);
+		if (r < 0)
+			break;
+		if (a.padding) {
+			r = -EOPNOTSUPP;
+			break;
+		}
+		if ((u64)(long)a.user_addr != a.user_addr) {
+			r = -EFAULT;
+			break;
+		}
+		vq->avail = (void __user *)(long)a.user_addr;
+		break;
+	case VHOST_SET_VRING_USED:
+		r = copy_from_user(&a, argp, sizeof a);
+		if (r < 0)
+			break;
+		if (a.padding) {
+			r = -EOPNOTSUPP;
+			break;
+		}
+		if ((u64)(long)a.user_addr != a.user_addr) {
+			r = -EFAULT;
+			break;
+		}
+		vq->used = (void __user *)(long)a.user_addr;
+		r = init_used(vq);
+		if (r)
+			break;
+		break;
+	case VHOST_SET_VRING_KICK:
+		r = copy_from_user(&f, argp, sizeof f);
+		if (r < 0)
+			break;
+		eventfp = f.fd == -1 ? NULL: eventfd_fget(f.fd);
+		if (IS_ERR(eventfp))
+			return PTR_ERR(eventfp);
+		if (eventfp != vq->kick) {
+			pollstop = filep = vq->kick;
+			pollstart = vq->kick = eventfp;
+		} else
+			filep = eventfp;
+		break;
+	case VHOST_SET_VRING_CALL:
+		r = copy_from_user(&f, argp, sizeof f);
+		if (r < 0)
+			break;
+		eventfp = f.fd == -1 ? NULL: eventfd_fget(f.fd);
+		if (IS_ERR(eventfp))
+			return PTR_ERR(eventfp);
+		if (eventfp != vq->call) {
+			filep = vq->call;
+			ctx = vq->call_ctx;
+			vq->call = eventfp;
+			vq->call_ctx = eventfp ?
+				eventfd_ctx_fileget(eventfp) : NULL;
+		} else
+			filep = eventfp;
+		break;
+	case VHOST_SET_VRING_ERR:
+		r = copy_from_user(&f, argp, sizeof f);
+		if (r < 0)
+			break;
+		eventfp = f.fd == -1 ? NULL: eventfd_fget(f.fd);
+		if (IS_ERR(eventfp))
+			return PTR_ERR(eventfp);
+		if (eventfp != vq->error) {
+			filep = vq->error;
+			vq->error = eventfp;
+			ctx = vq->error_ctx;
+			vq->error_ctx = eventfp ?
+				eventfd_ctx_fileget(eventfp) : NULL;
+		} else
+			filep = eventfp;
+		break;
+	default:
+		r = -ENOIOCTLCMD;
+	}
+
+	if (pollstop && vq->handle_kick)
+		vhost_poll_stop(&vq->poll);
+
+	if (ctx)
+		eventfd_ctx_put(ctx);
+	if (filep)
+		fput(filep);
+
+	if (pollstart && vq->handle_kick)
+		vhost_poll_start(&vq->poll, vq->kick);
+
+	mutex_unlock(&vq->mutex);
+
+	if (pollstop && vq->handle_kick)
+		vhost_poll_flush(&vq->poll);
+	return 0;
+}
+
+long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
+{
+	void __user *argp = (void __user *)arg;
+	long r;
+
+	mutex_lock(&d->mutex);
+	if (ioctl == VHOST_SET_OWNER) {
+		r = vhost_dev_set_owner(d);
+		goto done;
+	}
+
+	r = vhost_dev_check_owner(d);
+	if (r)
+		goto done;
+		
+	switch (ioctl) {
+	case VHOST_SET_MEM_TABLE:
+		r = vhost_set_memory(d, argp);
+		break;
+	default:
+		r = vhost_set_vring(d, ioctl, argp);
+		break;
+	}
+done:
+	mutex_unlock(&d->mutex);
+	return r;
+}
+
+static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
+						     __u64 addr, __u32 len)
+{
+	struct vhost_memory_region *reg;
+	int i;
+	/* linear search is not brilliant, but we really have on the order of 6
+	 * regions in practice */
+	for (i = 0; i < mem->nregions; ++i) {
+		reg = mem->regions + i;
+		if (reg->guest_phys_addr <= addr &&
+		    reg->guest_phys_addr + reg->memory_size - 1 >= addr)
+			return reg;
+	}
+	return NULL;
+}
+
+/* FIXME: this does not handle a region that spans multiple
+ * address/len pairs */
+int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
+		   struct iovec iov[], int iov_count, int iov_size,
+		   unsigned *num)
+{
+	const struct vhost_memory_region *reg;
+	struct vhost_memory *mem;
+	struct iovec *_iov;
+	u64 s = 0;
+	int ret = 0;
+
+	rcu_read_lock();
+
+	mem = rcu_dereference(dev->memory);
+	while ((u64)len > s) {
+		u64 size;
+		if (*num + iov_count >= iov_size) {
+			ret = -ENOBUFS;
+			break;
+		}
+		reg = find_region(mem, addr, len);
+		if (!reg) {
+			ret = -EFAULT;
+			break;
+		}
+		_iov = iov + iov_count + *num;
+		size = reg->memory_size - addr + reg->guest_phys_addr;
+		_iov->iov_len = min((u64)len, size);
+		_iov->iov_base = (void *)
+			(reg->userspace_addr + addr - reg->guest_phys_addr);
+		s += size;
+		addr += size;
+		++*num;
+	}
+	
+	rcu_read_unlock();
+	return ret;
+}
+
+/* Each buffer in the virtqueues is actually a chain of descriptors.  This
+ * function returns the next descriptor in the chain, or vq->vring.num if we're
+ * at the end. */
+static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
+{
+	unsigned int next;
+
+	/* If this descriptor says it doesn't chain, we're done. */
+	if (!(desc->flags & VRING_DESC_F_NEXT))
+		return vq->num;
+
+	/* Check they're not leading us off end of descriptors. */
+	next = desc->next;
+	/* Make sure compiler knows to grab that: we don't want it changing! */
+	/* We will use the result as an index in an array, so most
+	 * architectures only need a compiler barrier here. */
+	read_barrier_depends();
+
+	if (next >= vq->num) {
+		vq_err(vq, "Desc next is %u > %u", next, vq->num);
+		return vq->num;
+	}
+
+	return next;
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access.  Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which
+ * is never a valid descriptor number) if none was found. */
+unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+			   struct iovec iov[],
+			   unsigned int *out_num, unsigned int *in_num)
+{
+	struct vring_desc desc;
+	unsigned int i, head;
+	u16 last_avail_idx, idx;
+
+	/* Check it isn't doing very strange things with descriptor numbers. */
+	last_avail_idx = vq->last_avail_idx;
+	if (get_user(idx, &vq->avail->idx)) {
+		vq_err(vq, "Failed to access avail idx at %p\n",
+		       &vq->avail->idx);
+		return vq->num;
+	}
+	
+	if ((u16)(idx - last_avail_idx) > vq->num) {
+		vq_err(vq, "Guest moved used index from %u to %u",
+		       last_avail_idx, idx);
+		return vq->num;
+	}
+
+	/* If there's nothing new since last we looked, return invalid. */
+	if (idx == last_avail_idx)
+		return vq->num;
+
+	/* Grab the next descriptor number they're advertising, and increment
+	 * the index we've seen. */
+	if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
+		vq_err(vq, "Failed to read head: idx %d address %p\n",
+		       idx, &vq->avail->ring[last_avail_idx % vq->num]);
+		return vq->num;
+	}
+
+	/* If their number is silly, that's a fatal mistake. */
+	if (head >= vq->num) {
+		vq_err(vq, "Guest says index %u > %u is available",
+		       head, vq->num);
+		return vq->num;
+	}
+
+	vq->last_avail_idx++;
+
+	/* When we start there are none of either input nor output. */
+	*out_num = *in_num = 0;
+
+	i = head;
+	do {
+		unsigned *num;
+		unsigned iov_count;
+		if (copy_from_user(&desc, vq->desc + i, sizeof desc)) {
+			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
+			       i, vq->desc + i);
+			return vq->num;
+		}
+		/* If this is an input descriptor, increment that count. */
+		if (desc.flags & VRING_DESC_F_WRITE) {
+			num = in_num;
+			iov_count = *out_num;
+		} else {
+			/* If it's an output descriptor, they're all supposed
+			 * to come before any input descriptors. */
+			if (*in_num) {
+				vq_err(vq, "Descriptor has out after in: "
+				       "idx %d\n", i);
+				return vq->num;
+			}
+			num = out_num;
+			iov_count = *in_num;
+		}
+		if (translate_desc(dev, desc.addr, desc.len, iov, iov_count,
+				   VHOST_NET_MAX_SG, num)) {
+			vq_err(vq, "Failed to translate descriptor: idx %d\n",
+			       i);
+			return vq->num;
+		}
+
+		/* If we've got too many, that implies a descriptor loop. */
+		if (*out_num + *in_num > vq->num) {
+			vq_err(vq, "Looped descriptor: idx %d\n", i);
+			return vq->num;
+		}
+	} while ((i = next_desc(vq, &desc)) != vq->num);
+
+	vq->inflight++;
+	return head;
+}
+
+/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
+void vhost_discard_vq_desc(struct vhost_virtqueue *vq)
+{
+	vq->last_avail_idx--;
+	vq->inflight--;
+}
+
+/* After we've used one of their buffers, we tell them about it.  We'll then
+ * want to send them an interrupt, using vq->call. */
+int vhost_add_used(struct vhost_virtqueue *vq,
+			  unsigned int head, int len)
+{
+	struct vring_used_elem *used;
+
+	/* The virtqueue contains a ring of used buffers.  Get a pointer to the
+	 * next entry in that used ring. */
+	used = &vq->used->ring[vq->last_used_idx % vq->num];
+	if (put_user(head, &used->id)) {
+		vq_err(vq, "Failed to write used id");
+		return -EFAULT;
+	}
+	if (put_user(len, &used->len)) {
+		vq_err(vq, "Failed to write used len");
+		return -EFAULT;
+	}
+	/* Make sure buffer is written before we update index. */
+	wmb();
+	if (put_user(vq->last_used_idx + 1, &vq->used->idx)) {
+		vq_err(vq, "Failed to increment used idx");
+		return -EFAULT;
+	}
+	vq->last_used_idx++;
+	vq->inflight--;
+	return 0;
+}
+
+/* This actually sends the interrupt for this virtqueue */
+void vhost_trigger_irq(struct vhost_virtqueue *vq)
+{
+	__u16 flags = 0;
+	if (get_user(flags, &vq->avail->flags)) {
+		vq_err(vq, "Failed to get flags");
+		return;
+	}
+
+	/* If they don't want an interrupt, don't send one, unless empty. */
+	if ((flags & VRING_AVAIL_F_NO_INTERRUPT) && vq->inflight)
+		return;
+
+	/* Send the Guest an interrupt tell them we used something up. */
+	if (vq->call_ctx)
+		eventfd_signal(vq->call_ctx, 1);
+}
+
+/* And here's the combo meal deal.  Supersize me! */
+void vhost_add_used_and_trigger(struct vhost_virtqueue *vq,
+				unsigned int head, int len)
+{
+	vhost_add_used(vq, head, len);
+	vhost_trigger_irq(vq);
+}
+
+int vhost_init(void)
+{
+	vhost_workqueue = create_workqueue("vhost");
+	if (!vhost_workqueue)
+		return -ENOMEM;
+	return 0;
+}
+
+void vhost_cleanup(void)
+{
+	destroy_workqueue(vhost_workqueue);
+}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
new file mode 100644
index 0000000..7f7ffcd
--- /dev/null
+++ b/drivers/vhost/vhost.h
@@ -0,0 +1,108 @@ 
+#ifndef _VHOST_H
+#define _VHOST_H
+
+#include <linux/eventfd.h>
+#include <linux/vhost.h>
+#include <linux/mm.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/skbuff.h>
+
+struct vhost_device;
+
+enum {
+	VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
+};
+
+/* Poll a file (eventfd or socket) */
+/* Note: there's nothing vhost specific about this structure. */
+struct vhost_poll {
+	poll_table                table;
+	wait_queue_head_t        *wqh;
+	wait_queue_t              wait;
+	/* struct which will handle all actual work. */
+	struct work_struct        work;
+	unsigned long		  mask;
+};
+
+void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
+		     unsigned long mask);
+void vhost_poll_start(struct vhost_poll *poll, struct file *file);
+void vhost_poll_stop(struct vhost_poll *poll);
+void vhost_poll_flush(struct vhost_poll *poll);
+
+/* The virtqueue structure describes a queue attached to a device. */
+struct vhost_virtqueue {
+	struct vhost_dev *dev;
+
+	/* The actual ring of buffers. */
+	struct mutex mutex;
+	unsigned int num;
+	struct vring_desc __user *desc;
+	struct vring_avail __user *avail;
+	struct vring_used __user *used;
+	struct file *kick;
+	struct file *call;
+	struct file *error;
+	struct eventfd_ctx *call_ctx;
+	struct eventfd_ctx *error_ctx;
+
+	struct vhost_poll poll;
+
+	/* The routine to call when the Guest pings us, or timeout. */
+	work_func_t handle_kick;
+
+	/* Last available index we saw. */
+	u16 last_avail_idx;
+
+	/* Last index we used. */
+	u16 last_used_idx;
+
+	/* Outstanding buffers */
+	unsigned int inflight;
+
+	/* Is this blocked? */
+	bool blocked;
+
+	struct iovec iov[VHOST_NET_MAX_SG];
+
+} ____cacheline_aligned;
+
+struct vhost_dev {
+	/* Readers use RCU to access memory table pointer.
+	 * Writers use mutex below.*/
+	struct vhost_memory *memory;
+	struct mm_struct *mm;
+	struct vhost_virtqueue *vqs;
+	int nvqs;
+	struct mutex mutex;
+};
+
+long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
+long vhost_dev_check_owner(struct vhost_dev *);
+long vhost_dev_reset_owner(struct vhost_dev *);
+void vhost_dev_cleanup(struct vhost_dev *);
+long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, unsigned long arg);
+
+unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
+			   struct iovec iov[],
+			   unsigned int *out_num, unsigned int *in_num);
+void vhost_discard_vq_desc(struct vhost_virtqueue *);
+
+int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
+void vhost_trigger_irq(struct vhost_virtqueue *);
+void vhost_add_used_and_trigger(struct vhost_virtqueue *,
+				unsigned int head, int len);
+
+int vhost_init(void);
+void vhost_cleanup(void);
+
+#define vq_err(vq, fmt, ...) do {                                  \
+		printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__);       \
+		if ((vq)->error_ctx)                               \
+				eventfd_signal((vq)->error_ctx, 1);\
+	} while (0)
+
+#endif
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index dec2f18..975df9a 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -360,6 +360,7 @@  unifdef-y += uio.h
 unifdef-y += unistd.h
 unifdef-y += usbdevice_fs.h
 unifdef-y += utsname.h
+unifdef-y += vhost.h
 unifdef-y += videodev2.h
 unifdef-y += videodev.h
 unifdef-y += virtio_config.h
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index 0521177..781a8bb 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -30,6 +30,7 @@ 
 #define HPET_MINOR		228
 #define FUSE_MINOR		229
 #define KVM_MINOR		232
+#define VHOST_NET_MINOR		233
 #define MISC_DYNAMIC_MINOR	255
 
 struct device;
diff --git a/include/linux/vhost.h b/include/linux/vhost.h
new file mode 100644
index 0000000..9ec6d5f
--- /dev/null
+++ b/include/linux/vhost.h
@@ -0,0 +1,100 @@ 
+#ifndef _LINUX_VHOST_H
+#define _LINUX_VHOST_H
+/* Userspace interface for in-kernel virtio accelerators. */
+
+/* vhost is used to reduce the number of system calls involved in virtio.
+ *
+ * Existing virtio net code is used in the guest without modification.
+ *
+ * This header includes interface used by userspace hypervisor for
+ * device configuration.
+ */
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+#include <linux/ioctl.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_ring.h>
+
+struct vhost_vring_state {
+	unsigned int index;
+	unsigned int num;
+};
+
+struct vhost_vring_file {
+	unsigned int index;
+	int fd;
+};
+
+struct vhost_vring_addr {
+	unsigned int index;
+	unsigned int padding;
+	__u64 user_addr;
+};
+
+struct vhost_memory_region {
+	__u64 guest_phys_addr;
+	__u64 memory_size; /* bytes */
+	__u64 userspace_addr;
+	__u64 padding; /* read/write protection? */
+};
+
+struct vhost_memory {
+	__u32 nregions;
+	__u32 padding;
+	struct vhost_memory_region regions[0];
+};
+
+/* ioctls */
+
+#define VHOST_VIRTIO 0xAF
+
+/* Features bitmask for forward compatibility. Transport bits must be zero. */
+#define VHOST_GET_FEATURES	_IOR(VHOST_VIRTIO, 0x00, __u32)
+#define VHOST_ACK_FEATURES	_IOW(VHOST_VIRTIO, 0x00, __u32)
+
+/* Set current process as the (exclusive) owner of this file descriptor.  This
+ * must be called before any other vhost command.  Further calls to
+ * VHOST_OWNER_SET fail until VHOST_OWNER_RESET is called. */
+#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
+/* Give up ownership, and reset the device to default values.
+ * Allows subsequent call to VHOST_OWNER_SET to succeed. */
+#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
+
+/* Set up/modify memory layout */
+#define VHOST_SET_MEM_TABLE	_IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
+
+/* Ring setup. These parameters can not be modified while ring is running
+ * (bound to a device). */
+/* Set number of descriptors in ring */
+#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
+/* Start of array of descriptors (virtually contiguous) */
+#define VHOST_SET_VRING_DESC _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
+/* Used structure address */
+#define VHOST_SET_VRING_USED _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_addr)
+/* Available structure address */
+#define VHOST_SET_VRING_AVAIL _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_addr)
+/* Base value where queue looks for available descriptors */
+#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
+/* Get accessor: reads index, writes value in num */
+#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
+
+/* The following ioctls use eventfd file descriptors to signal and poll
+ * for events. */
+
+/* Set eventfd to poll for added buffers */
+#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
+/* Set eventfd to signal when buffers have beed used */
+#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
+/* Set eventfd to signal an error */
+#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
+
+/* VHOST_NET specific defines */
+
+/* Attach virtio net device to a raw socket. The socket must be already
+ * bound to an ethernet device, this device will be used for transmit.
+ * Pass -1 to unbind from the socket and the transmit device.
+ * This can be used to stop the device (e.g. for migration). */
+#define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, int)
+
+#endif