diff mbox

[v10] kvm: add support for irqfd

Message ID 20090520142234.22285.72274.stgit@dev.haskins.net (mailing list archive)
State New, archived
Headers show

Commit Message

Gregory Haskins May 20, 2009, 2:30 p.m. UTC
(Applies to kvm.git/queue:fd2e987d)

KVM provides a complete virtual system environment for guests, including
support for injecting interrupts modeled after the real exception/interrupt
facilities present on the native platform (such as the IDT on x86).
Virtual interrupts can come from a variety of sources (emulated devices,
pass-through devices, etc) but all must be injected to the guest via
the KVM infrastructure.  This patch adds a new mechanism to inject a specific
interrupt to a guest using a decoupled eventfd mechnanism:  Any legal signal
on the irqfd (using eventfd semantics from either userspace or kernel) will
translate into an injected interrupt in the guest at the next available
interrupt window.

This has been unit-tested with an updated version of my test harness, which
you can download here:

ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2

The test verifies both assign and deassign paths, and they appear to work
as intended.

The cooresponding userspace patches from v8 are unchanged, which you can find
here:

http://www.mail-archive.com/kvm@vger.kernel.org/msg14913.html

[ Changelog:

   v10:
	*) Fixed formatting/consistency issue in irqfd_remove
	*) Fixed return value error in deassign
	*) Fixed grammatical errors in comments
	*) Rebased to kvm.git/queue branch

   v9:
        *) Fixed a bug in deassign where we could deadlock with the way
           flush_work was being used (Thanks to Marcelo Tosatti's for spotting
           this bug).
        *) Rebased to kvm.git:2ffc3882

   v8:
	*) Re-seperated irqfd and iofd (now called iosignalfd) into two 
  	   distinct series.
        *) We compare both the fd/file and gsi on deassign
        *) De-assign is exhaustive (to support multiple associations in the
           future)
        *) s/KVM_CAP_EVENTFD/KVM_CAP_IRQFD

   v7:
        *) Added "iofd" to allow PIO/MMIO writes to generate an eventfd
           signal.  This was previously discussed as "hypercallfd", but
           since explicit hypercalls are not looking to be very popular,
           and based on the fact that they were not going to carry payload
           anyway, I named them "iofd".
        *) Generalized some of the code so that irqfd and iofd could be
           logically grouped together.  For instance
           s/KVM_CAP_IRQFD/KVM_CAP_EVENTFD and virt/kvm/irqfd.c becomes
	   virt/kvm/eventfd.c
        *) Added support for "deassign" operations to ensure we can properly
           support hot-unplug.
	*) Reinstated the eventfd EXPORT_SYMBOL patch since we need it again
           for supporting iofd.
        *) Rebased to kvm.git:b5e725fa

   v6:
        *) Moved eventfd creation back to userspace, per Avi's request
        *) Dropped no longer necessary supporting patches from series
        *) Rebased to kvm.git:833367b57

   v5:
        *) Added padding to the ioctl structure
        *) Added proper ref-count increment to the file before returning
           success. (Needs review by Al Viro, Davide Libenzi)
	*) Cleaned up error-handling path to make sure we remove ourself
	   from the waitq if necessary.
        *) Make sure we only add ourselves to kvm->irqfds if successful
           creating the irqfd in the first place.
	*) Rebased to kvm.git:66b0aed4

   v4:
        *) Changed allocation model to create the new fd last, after
           we get past the last potential error point by using Davide's
           new eventfd_file_create interface (Al Viro, Davide Libenzi)
	*) We no longer export sys_eventfd2() since it is replaced
           functionally with eventfd_file_create();
        *) Rebased to kvm.git:7da2e3ba

   v3:
        *) The kernel now allocates the eventfd (need to export sys_eventfd2)
        *) Added a flags field for future expansion to kvm_irqfd()
        *) We properly toggle the irq level 1+0.
        *) We re-use the USERSPACE_SRC_ID instead of creating our own
	*) Properly check for failures establishing a poll-table with eventfd
	*) Fixed fd/file leaks on failure
	*) Rebased to lateste kvm.git::41b76d8d04

   v2:
	*) Dropped notifier_chain based callbacks in favor of
	   wait_queue_t::func and file::poll based callbacks (Thanks to
	   Davide for the suggestion)

   v1:
        *) Initial release

]

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/kvm/Makefile    |    2 
 arch/x86/kvm/x86.c       |    1 
 include/linux/kvm.h      |   11 ++
 include/linux/kvm_host.h |    4 +
 virt/kvm/eventfd.c       |  228 ++++++++++++++++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c      |   11 ++
 6 files changed, 256 insertions(+), 1 deletions(-)
 create mode 100644 virt/kvm/eventfd.c


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Michael S. Tsirkin May 27, 2009, 1:55 p.m. UTC | #1
On Wed, May 20, 2009 at 10:30:49AM -0400, Gregory Haskins wrote:
> +static int
> +kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> +{
> +	struct _irqfd *irqfd;
> +	struct file *file = NULL;
> +	int ret;
> +
> +	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> +	if (!irqfd)
> +		return -ENOMEM;
> +
> +	irqfd->kvm = kvm;
> +	irqfd->gsi = gsi;
> +	INIT_LIST_HEAD(&irqfd->list);
> +	INIT_WORK(&irqfd->work, irqfd_inject);
> +
> +	/*
> +	 * Embed the file* lifetime in the irqfd.
> +	 */
> +	file = fget(fd);
> +	if (IS_ERR(file)) {
> +		ret = PTR_ERR(file);
> +		goto fail;
> +	}

So we get a reference to a file, and unless the user is nice to us, it
will only be dropped when kvm char device file is closed?
I think this will deadlock if the fd in question is the open kvm char device.
Gregory Haskins May 27, 2009, 2:06 p.m. UTC | #2
Michael S. Tsirkin wrote:
> On Wed, May 20, 2009 at 10:30:49AM -0400, Gregory Haskins wrote:
>   
>> +static int
>> +kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
>> +{
>> +	struct _irqfd *irqfd;
>> +	struct file *file = NULL;
>> +	int ret;
>> +
>> +	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
>> +	if (!irqfd)
>> +		return -ENOMEM;
>> +
>> +	irqfd->kvm = kvm;
>> +	irqfd->gsi = gsi;
>> +	INIT_LIST_HEAD(&irqfd->list);
>> +	INIT_WORK(&irqfd->work, irqfd_inject);
>> +
>> +	/*
>> +	 * Embed the file* lifetime in the irqfd.
>> +	 */
>> +	file = fget(fd);
>> +	if (IS_ERR(file)) {
>> +		ret = PTR_ERR(file);
>> +		goto fail;
>> +	}
>>     
>
> So we get a reference to a file, and unless the user is nice to us, it
> will only be dropped when kvm char device file is closed?
> I think this will deadlock if the fd in question is the open kvm char device.
>
>
>   
Hmm...I hadn't considered this possibility, though I am not sure if it
would cause a deadlock in the pattern you suggest.  It seems more like
it would result in, at worst, an extra reference to itself (and thus a
leak) rather than a deadlock...

I digress.  In either case, perhaps I should s/fget/eventfd_fget to at
least limit the type of fd to eventfd.  I was trying to be "slick" by
not needing the eventfd_fget() exported, but I am going to need to
export it later anyway for iosignalfd, so its probably a moot point.

Thanks Michael,
-Greg
Gregory Haskins May 27, 2009, 2:36 p.m. UTC | #3
Michael Tsirkin pointed out an issue in kvm.git w.r.t. malicious userspace
configuring irqfd objects:

http://lkml.org/lkml/2009/5/27/341

This series applies to kvm.git/master:1a6a35a1 to attempt to close the
vulnerability.

---

Gregory Haskins (2):
      kvm: validate irqfd type
      eventfd: export eventfd interfaces for module use


 fs/eventfd.c       |    3 +++
 virt/kvm/eventfd.c |    3 ++-
 2 files changed, 5 insertions(+), 1 deletions(-)
Gregory Haskins May 27, 2009, 3:06 p.m. UTC | #4
Gregory Haskins wrote:
> Michael Tsirkin pointed out an issue in kvm.git w.r.t. malicious userspace
> configuring irqfd objects:
>
> http://lkml.org/lkml/2009/5/27/341
>
> This series applies to kvm.git/master:1a6a35a1 to attempt to close the
> vulnerability.
>   

Avi,

FYI: This is also available as a git-pull:

git pull
git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git
for-avi

> ---
>
> Gregory Haskins (2):
>       kvm: validate irqfd type
>       eventfd: export eventfd interfaces for module use
>
>
>  fs/eventfd.c       |    3 +++
>  virt/kvm/eventfd.c |    3 ++-
>  2 files changed, 5 insertions(+), 1 deletions(-)
>
>
Michael S. Tsirkin May 27, 2009, 6:41 p.m. UTC | #5
On Wed, May 27, 2009 at 10:06:50AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Wed, May 20, 2009 at 10:30:49AM -0400, Gregory Haskins wrote:
> >   
> >> +static int
> >> +kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> >> +{
> >> +	struct _irqfd *irqfd;
> >> +	struct file *file = NULL;
> >> +	int ret;
> >> +
> >> +	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> >> +	if (!irqfd)
> >> +		return -ENOMEM;
> >> +
> >> +	irqfd->kvm = kvm;
> >> +	irqfd->gsi = gsi;
> >> +	INIT_LIST_HEAD(&irqfd->list);
> >> +	INIT_WORK(&irqfd->work, irqfd_inject);
> >> +
> >> +	/*
> >> +	 * Embed the file* lifetime in the irqfd.
> >> +	 */
> >> +	file = fget(fd);
> >> +	if (IS_ERR(file)) {
> >> +		ret = PTR_ERR(file);
> >> +		goto fail;
> >> +	}
> >>     
> >
> > So we get a reference to a file, and unless the user is nice to us, it
> > will only be dropped when kvm char device file is closed?
> > I think this will deadlock if the fd in question is the open kvm char device.
> >
> >
> >   
> Hmm...I hadn't considered this possibility, though I am not sure if it
> would cause a deadlock in the pattern you suggest.  It seems more like
> it would result in, at worst, an extra reference to itself (and thus a
> leak) rather than a deadlock...
> 
> I digress.  In either case, perhaps I should s/fget/eventfd_fget to at
> least limit the type of fd to eventfd.  I was trying to be "slick" by
> not needing the eventfd_fget() exported, but I am going to need to
> export it later anyway for iosignalfd, so its probably a moot point.
> 
> Thanks Michael,
> -Greg
> 

This only works as long as eventfd does not do fget on some fd as well.
Which it does not do now, and may never do - but we create a fragile
system this way.

I think it's really wrong, fundamentally, to keep a reference to a
file until another file is closed, unless you are code under fs/.
We will get nasty circular references sooner or later.

Isn't the real reason we use fd to be able to support the same interface
on top of both kvm and lguest?
And if so, wouldn't some kind of bus be a better solution?
Gregory Haskins May 27, 2009, 8:07 p.m. UTC | #6
Michael S. Tsirkin wrote:
> On Wed, May 27, 2009 at 10:06:50AM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Wed, May 20, 2009 at 10:30:49AM -0400, Gregory Haskins wrote:
>>>   
>>>       
>>>> +static int
>>>> +kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
>>>> +{
>>>> +	struct _irqfd *irqfd;
>>>> +	struct file *file = NULL;
>>>> +	int ret;
>>>> +
>>>> +	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
>>>> +	if (!irqfd)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	irqfd->kvm = kvm;
>>>> +	irqfd->gsi = gsi;
>>>> +	INIT_LIST_HEAD(&irqfd->list);
>>>> +	INIT_WORK(&irqfd->work, irqfd_inject);
>>>> +
>>>> +	/*
>>>> +	 * Embed the file* lifetime in the irqfd.
>>>> +	 */
>>>> +	file = fget(fd);
>>>> +	if (IS_ERR(file)) {
>>>> +		ret = PTR_ERR(file);
>>>> +		goto fail;
>>>> +	}
>>>>     
>>>>         
>>> So we get a reference to a file, and unless the user is nice to us, it
>>> will only be dropped when kvm char device file is closed?
>>> I think this will deadlock if the fd in question is the open kvm char device.
>>>
>>>
>>>   
>>>       
>> Hmm...I hadn't considered this possibility, though I am not sure if it
>> would cause a deadlock in the pattern you suggest.  It seems more like
>> it would result in, at worst, an extra reference to itself (and thus a
>> leak) rather than a deadlock...
>>
>> I digress.  In either case, perhaps I should s/fget/eventfd_fget to at
>> least limit the type of fd to eventfd.  I was trying to be "slick" by
>> not needing the eventfd_fget() exported, but I am going to need to
>> export it later anyway for iosignalfd, so its probably a moot point.
>>
>> Thanks Michael,
>> -Greg
>>
>>     
>
> This only works as long as eventfd does not do fget on some fd as well.
> Which it does not do now, and may never do - but we create a fragile
> system this way.
>
> I think it's really wrong, fundamentally, to keep a reference to a
> file until another file is closed, unless you are code under fs/.
> We will get nasty circular references sooner or later.
>   

Hmm.. I understand your concern, but I respectfully disagree.

One object referencing another is a natural expression, regardless of
what type they may be.  The fact is that introducing the concept of
irqfd creates a relationship between an eventfd instance and a kvm
instance whether we like it or not, and this relationship needs to be
managed.  It is therefore IMO perfectly natural to express that
relationship with a reference count, and I do not currently see anything
wrong or even particularly fragile about how I've currently done this. 
I'm sure there are other ways, however.  Do you have a particular
suggestion in mind?

> Isn't the real reason we use fd to be able to support the same interface
> on top of both kvm and lguest?
>   

Actually, the reason why we use an fd is to decouple the
interrupt-producing end-point from the KVM core.  Ignoring eventfd in
specific for a moment, one convenient way to do that is with an fd
because it provides a nice, already written/tested handle-to-pointer
translation, and a polymorphic interface (e.g. f_ops).  Choosing to use
eventfd flavored fd's buys us additional advantages in terms of
leveraging already tested f_ops code, and compatibility with an
interface that is designed-for/used-by other established subsystems for
signaling.
> And if so, wouldn't some kind of bus be a better solution?
>   

Ultimately I aim to implement a bus (vbus, specifically) in terms of
irqfd (and iosignalfd, for that matter).  However, the eventfd
interfaces are general purpose and can be used in other areas as well
(for instance, virtio-pci, or the shared-mem driver recently
discussed).  I realize this is probably not the point you were making
here, but fyi.

Regards,
-Greg
Michael S. Tsirkin May 27, 2009, 8:43 p.m. UTC | #7
On Wed, May 27, 2009 at 04:07:23PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Wed, May 27, 2009 at 10:06:50AM -0400, Gregory Haskins wrote:
> >   
> >> Michael S. Tsirkin wrote:
> >>     
> >>> On Wed, May 20, 2009 at 10:30:49AM -0400, Gregory Haskins wrote:
> >>>   
> >>>       
> >>>> +static int
> >>>> +kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> >>>> +{
> >>>> +	struct _irqfd *irqfd;
> >>>> +	struct file *file = NULL;
> >>>> +	int ret;
> >>>> +
> >>>> +	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> >>>> +	if (!irqfd)
> >>>> +		return -ENOMEM;
> >>>> +
> >>>> +	irqfd->kvm = kvm;
> >>>> +	irqfd->gsi = gsi;
> >>>> +	INIT_LIST_HEAD(&irqfd->list);
> >>>> +	INIT_WORK(&irqfd->work, irqfd_inject);
> >>>> +
> >>>> +	/*
> >>>> +	 * Embed the file* lifetime in the irqfd.
> >>>> +	 */
> >>>> +	file = fget(fd);
> >>>> +	if (IS_ERR(file)) {
> >>>> +		ret = PTR_ERR(file);
> >>>> +		goto fail;
> >>>> +	}
> >>>>     
> >>>>         
> >>> So we get a reference to a file, and unless the user is nice to us, it
> >>> will only be dropped when kvm char device file is closed?
> >>> I think this will deadlock if the fd in question is the open kvm char device.
> >>>
> >>>
> >>>   
> >>>       
> >> Hmm...I hadn't considered this possibility, though I am not sure if it
> >> would cause a deadlock in the pattern you suggest.  It seems more like
> >> it would result in, at worst, an extra reference to itself (and thus a
> >> leak) rather than a deadlock...
> >>
> >> I digress.  In either case, perhaps I should s/fget/eventfd_fget to at
> >> least limit the type of fd to eventfd.  I was trying to be "slick" by
> >> not needing the eventfd_fget() exported, but I am going to need to
> >> export it later anyway for iosignalfd, so its probably a moot point.
> >>
> >> Thanks Michael,
> >> -Greg
> >>
> >>     
> >
> > This only works as long as eventfd does not do fget on some fd as well.
> > Which it does not do now, and may never do - but we create a fragile
> > system this way.
> >
> > I think it's really wrong, fundamentally, to keep a reference to a
> > file until another file is closed, unless you are code under fs/.
> > We will get nasty circular references sooner or later.
> >   
> 
> Hmm.. I understand your concern, but I respectfully disagree.
> 
> One object referencing another is a natural expression, regardless of
> what type they may be.  The fact is that introducing the concept of
> irqfd creates a relationship between an eventfd instance and a kvm
> instance whether we like it or not, and this relationship needs to be
> managed.  It is therefore IMO perfectly natural to express that
> relationship with a reference count, and I do not currently see anything
> wrong or even particularly fragile about how I've currently done this. 
> I'm sure there are other ways, however.  Do you have a particular
> suggestion in mind?

Yes. I'll try to post a patch soon.

> > Isn't the real reason we use fd to be able to support the same interface
> > on top of both kvm and lguest?
> >   
> 
> Actually, the reason why we use an fd is to decouple the
> interrupt-producing end-point from the KVM core.  Ignoring eventfd in
> specific for a moment, one convenient way to do that is with an fd
> because it provides a nice, already written/tested handle-to-pointer
> translation, and a polymorphic interface (e.g. f_ops).  Choosing to use
> eventfd flavored fd's buys us additional advantages in terms of
> leveraging already tested f_ops code, and compatibility with an
> interface that is designed-for/used-by other established subsystems for
> signaling.
> > And if so, wouldn't some kind of bus be a better solution?
> >   
> 
> Ultimately I aim to implement a bus (vbus, specifically) in terms of
> irqfd (and iosignalfd, for that matter).  However, the eventfd
> interfaces are general purpose and can be used in other areas as well
> (for instance, virtio-pci, or the shared-mem driver recently
> discussed).  I realize this is probably not the point you were making
> here, but fyi.
> 
> Regards,
> -Greg
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins May 27, 2009, 8:46 p.m. UTC | #8
Michael S. Tsirkin wrote:
> On Wed, May 27, 2009 at 04:07:23PM -0400, Gregory Haskins wrote:
>   
>> Do you have a particular suggestion in mind?
>>     
>
> Yes. I'll try to post a patch soon.
>
>   

Sounds good.  Thanks Michael.

-Greg
Avi Kivity May 31, 2009, 9:36 a.m. UTC | #9
Gregory Haskins wrote:
> Michael Tsirkin pointed out an issue in kvm.git w.r.t. malicious userspace
> configuring irqfd objects:
>
> http://lkml.org/lkml/2009/5/27/341
>
> This series applies to kvm.git/master:1a6a35a1 to attempt to close the
> vulnerability.
>
>   

Applied, thanks.
Michael S. Tsirkin June 11, 2009, 1:16 p.m. UTC | #10
Going over this code again, I seem to see a minor error handling issue here:

On Wed, May 20, 2009 at 10:30:49AM -0400, Gregory Haskins wrote:
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> new file mode 100644
> index 0000000..72a282e
> --- /dev/null
> +++ b/virt/kvm/eventfd.c
> @@ -0,0 +1,228 @@
> +/*
> + * kvm eventfd support - use eventfd objects to signal various KVM events
> + *
> + * Copyright 2009 Novell.  All Rights Reserved.
> + *
> + * Author:
> + *	Gregory Haskins <ghaskins@novell.com>
> + *
> + * This file is free software; you can redistribute it and/or modify
> + * it under the terms of version 2 of the GNU General Public License
> + * as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software Foundation,
> + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <linux/workqueue.h>
> +#include <linux/syscalls.h>
> +#include <linux/wait.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/list.h>
> +
> +/*
> + * --------------------------------------------------------------------
> + * irqfd: Allows an fd to be used to inject an interrupt to the guest
> + *
> + * Credit goes to Avi Kivity for the original idea.
> + * --------------------------------------------------------------------
> + */
> +struct _irqfd {
> +	struct kvm               *kvm;
> +	int                       gsi;
> +	struct file              *file;
> +	struct list_head          list;
> +	poll_table                pt;
> +	wait_queue_head_t        *wqh;
> +	wait_queue_t              wait;
> +	struct work_struct        work;
> +};
> +
> +static void
> +irqfd_inject(struct work_struct *work)
> +{
> +	struct _irqfd *irqfd = container_of(work, struct _irqfd, work);
> +	struct kvm *kvm = irqfd->kvm;
> +
> +	mutex_lock(&kvm->lock);
> +	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
> +	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> +	mutex_unlock(&kvm->lock);
> +}
> +
> +static int
> +irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> +{
> +	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
> +
> +	/*
> +	 * The wake_up is called with interrupts disabled.  Therefore we need
> +	 * to defer the IRQ injection until later since we need to acquire the
> +	 * kvm->lock to do so.
> +	 */
> +	schedule_work(&irqfd->work);
> +
> +	return 0;
> +}
> +
> +static void
> +irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
> +			poll_table *pt)
> +{
> +	struct _irqfd *irqfd = container_of(pt, struct _irqfd, pt);
> +
> +	irqfd->wqh = wqh;
> +	add_wait_queue(wqh, &irqfd->wait);
> +}
> +
> +static int
> +kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> +{
> +	struct _irqfd *irqfd;
> +	struct file *file = NULL;
> +	int ret;
> +
> +	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> +	if (!irqfd)
> +		return -ENOMEM;
> +
> +	irqfd->kvm = kvm;
> +	irqfd->gsi = gsi;
> +	INIT_LIST_HEAD(&irqfd->list);
> +	INIT_WORK(&irqfd->work, irqfd_inject);
> +
> +	/*
> +	 * Embed the file* lifetime in the irqfd.
> +	 */
> +	file = fget(fd);
> +	if (IS_ERR(file)) {
> +		ret = PTR_ERR(file);
> +		goto fail;
> +	}
> +
> +	/*
> +	 * Install our own custom wake-up handling so we are notified via
> +	 * a callback whenever someone signals the underlying eventfd
> +	 */
> +	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
> +	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
> +
> +	ret = file->f_op->poll(file, &irqfd->pt);
> +	if (ret < 0)
> +		goto fail;
> +
> +	irqfd->file = file;
> +
> +	mutex_lock(&kvm->lock);
> +	list_add_tail(&irqfd->list, &kvm->irqfds);
> +	mutex_unlock(&kvm->lock);
> +
> +	return 0;
> +
> +fail:
> +	if (irqfd->wqh)
> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);

Why are these 2 lines here? Either we might get a callback even though
poll failed - and then this test without lock is probably racy -
or we can't, and then we can replace the above with BUG_ON(irqfd->wqh).

Which is it? I think the later ...


> +
> +	if (file && !IS_ERR(file))
> +		fput(file);
> +
> +	kfree(irqfd);
> +	return ret;
> +}
> +
> +static void
> +irqfd_release(struct _irqfd *irqfd)
> +{
> +	/*
> +	 * The ordering is important.  We must remove ourselves from the wqh
> +	 * first to ensure no more event callbacks are issued, and then flush
> +	 * any previously scheduled work prior to freeing the memory
> +	 */
> +	remove_wait_queue(irqfd->wqh, &irqfd->wait);
> +
> +	flush_work(&irqfd->work);
> +
> +	fput(irqfd->file);
> +	kfree(irqfd);
> +}
> +
> +static struct _irqfd *
> +irqfd_remove(struct kvm *kvm, struct file *file, int gsi)
> +{
> +	struct _irqfd *irqfd;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	/*
> +	 * linear search isn't brilliant, but this should be an infrequent
> +	 * slow-path operation, and the list should not grow very large
> +	 */
> +	list_for_each_entry(irqfd, &kvm->irqfds, list) {
> +		if (irqfd->file != file || irqfd->gsi != gsi)
> +			continue;
> +
> +		list_del(&irqfd->list);
> +		mutex_unlock(&kvm->lock);
> +
> +		return irqfd;
> +	}
> +
> +	mutex_unlock(&kvm->lock);
> +
> +	return NULL;
> +}
> +
> +static int
> +kvm_deassign_irqfd(struct kvm *kvm, int fd, int gsi)
> +{
> +	struct _irqfd *irqfd;
> +	struct file *file;
> +	int count = 0;
> +
> +	file = fget(fd);
> +	if (IS_ERR(file))
> +		return PTR_ERR(file);
> +
> +	while ((irqfd = irqfd_remove(kvm, file, gsi))) {
> +		/*
> +		 * We remove the item from the list under the lock, but we
> +		 * free it outside the lock to avoid deadlocking with the
> +		 * flush_work and the work_item taking the lock
> +		 */
> +		irqfd_release(irqfd);
> +		count++;
> +	}
> +
> +	fput(file);
> +
> +	return count ? count : -ENOENT;
> +}
> +
> +int
> +kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> +{
> +	if (flags & KVM_IRQFD_FLAG_DEASSIGN)
> +		return kvm_deassign_irqfd(kvm, fd, gsi);
> +
> +	return kvm_assign_irqfd(kvm, fd, gsi);
> +}
> +
> +void
> +kvm_irqfd_release(struct kvm *kvm)
> +{
> +	struct _irqfd *irqfd, *tmp;
> +
> +	/* don't bother with the lock..we are shutting down */
> +	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
> +		list_del(&irqfd->list);
> +		irqfd_release(irqfd);
> +	}
> +}
Michael S. Tsirkin June 11, 2009, 1:36 p.m. UTC | #11
On Thu, Jun 11, 2009 at 04:16:47PM +0300, Michael S. Tsirkin wrote:
> > +
> > +	ret = file->f_op->poll(file, &irqfd->pt);
> > +	if (ret < 0)
> > +		goto fail;

Looking at it some more, we have:
struct file_operations {
....
	unsigned int (*poll) (struct file *, struct poll_table_struct *);

So the comparison above does not seem to make sense:
it seems that the return value from poll can not be negative.

Will the callback be executed if someone did a write to eventfd
before we attached it? If no, maybe we should call it here
if ret != 0.


> > +
> > +	irqfd->file = file;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	list_add_tail(&irqfd->list, &kvm->irqfds);
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	return 0;
> > +
> > +fail:
> > +	if (irqfd->wqh)
> > +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> 
> Why are these 2 lines here? Either we might get a callback even though
> poll failed - and then this test without lock is probably racy -
> or we can't, and then we can replace the above with BUG_ON(irqfd->wqh).
> 
> Which is it? I think the later ...
> 
> 
> > +
> > +	if (file && !IS_ERR(file))
> > +		fput(file);
> > +
> > +	kfree(irqfd);
> > +	return ret;
> > +}
> > +
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 14, 2009, 9:25 a.m. UTC | #12
On Wed, May 20, 2009 at 10:30:49AM -0400, Gregory Haskins wrote:

...

> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> +static int
> +kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> +{
> +	struct _irqfd *irqfd;
> +	struct file *file = NULL;
> +	int ret;
> +
> +	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> +	if (!irqfd)
> +		return -ENOMEM;
> +
> +	irqfd->kvm = kvm;
> +	irqfd->gsi = gsi;
> +	INIT_LIST_HEAD(&irqfd->list);
> +	INIT_WORK(&irqfd->work, irqfd_inject);
> +
> +	/*
> +	 * Embed the file* lifetime in the irqfd.
> +	 */
> +	file = fget(fd);
> +	if (IS_ERR(file)) {
> +		ret = PTR_ERR(file);
> +		goto fail;
> +	}
> +
> +	/*
> +	 * Install our own custom wake-up handling so we are notified via
> +	 * a callback whenever someone signals the underlying eventfd
> +	 */
> +	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
> +	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
> +
> +	ret = file->f_op->poll(file, &irqfd->pt);
> +	if (ret < 0)
> +		goto fail;
> +
> +	irqfd->file = file;
> +
> +	mutex_lock(&kvm->lock);
> +	list_add_tail(&irqfd->list, &kvm->irqfds);
> +	mutex_unlock(&kvm->lock);
> +
> +	return 0;
> +
> +fail:
> +	if (irqfd->wqh)
> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> +
> +	if (file && !IS_ERR(file))
> +		fput(file);
> +
> +	kfree(irqfd);
> +	return ret;
> +}

It seems that this lets the guest assign an unlimited number of fds
to the same gsi, potentially using up all of kernel memory.

Since we don't need multiple fds assigned to the same gsi (instead,
multiple processes can write to the same eventfd to trigger an
interrupt) let's simply check that no fd is yet assigned to this gsi.

Makes sense?
Gregory Haskins June 14, 2009, 12:25 p.m. UTC | #13
Michael S. Tsirkin wrote:
> On Thu, Jun 11, 2009 at 04:16:47PM +0300, Michael S. Tsirkin wrote:
>   
>>> +
>>> +	ret = file->f_op->poll(file, &irqfd->pt);
>>> +	if (ret < 0)
>>> +		goto fail;
>>>       
>
> Looking at it some more, we have:
> struct file_operations {
> ....
> 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
>
> So the comparison above does not seem to make sense:
> it seems that the return value from poll can not be negative.
>   

Indeed.  Will fix.
> Will the callback be executed if someone did a write to eventfd
> before we attached it? If no, maybe we should call it here
> if ret != 0.
>   

I do the cleanup in case the callback has been called, but poll() fails
somewhere internally afterwards.  Perhaps this is not a realistic
scenario, but it was my motivation for adding the wqh cleanup.
>
>   
>>> +
>>> +	irqfd->file = file;
>>> +
>>> +	mutex_lock(&kvm->lock);
>>> +	list_add_tail(&irqfd->list, &kvm->irqfds);
>>> +	mutex_unlock(&kvm->lock);
>>> +
>>> +	return 0;
>>> +
>>> +fail:
>>> +	if (irqfd->wqh)
>>> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>>       
>> Why are these 2 lines here? Either we might get a callback even though
>> poll failed - and then this test without lock is probably racy -
>> or we can't, and then we can replace the above with BUG_ON(irqfd->wqh).
>>
>> Which is it? I think the later ...
>>
>>
>>     
>>> +
>>> +	if (file && !IS_ERR(file))
>>> +		fput(file);
>>> +
>>> +	kfree(irqfd);
>>> +	return ret;
>>> +}
>>> +
>>>
Gregory Haskins June 14, 2009, 12:40 p.m. UTC | #14
Michael S. Tsirkin wrote:
> On Wed, May 20, 2009 at 10:30:49AM -0400, Gregory Haskins wrote:
>
> ...
>
>   
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> +static int
>> +kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
>> +{
>> +	struct _irqfd *irqfd;
>> +	struct file *file = NULL;
>> +	int ret;
>> +
>> +	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
>> +	if (!irqfd)
>> +		return -ENOMEM;
>> +
>> +	irqfd->kvm = kvm;
>> +	irqfd->gsi = gsi;
>> +	INIT_LIST_HEAD(&irqfd->list);
>> +	INIT_WORK(&irqfd->work, irqfd_inject);
>> +
>> +	/*
>> +	 * Embed the file* lifetime in the irqfd.
>> +	 */
>> +	file = fget(fd);
>> +	if (IS_ERR(file)) {
>> +		ret = PTR_ERR(file);
>> +		goto fail;
>> +	}
>> +
>> +	/*
>> +	 * Install our own custom wake-up handling so we are notified via
>> +	 * a callback whenever someone signals the underlying eventfd
>> +	 */
>> +	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
>> +	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
>> +
>> +	ret = file->f_op->poll(file, &irqfd->pt);
>> +	if (ret < 0)
>> +		goto fail;
>> +
>> +	irqfd->file = file;
>> +
>> +	mutex_lock(&kvm->lock);
>> +	list_add_tail(&irqfd->list, &kvm->irqfds);
>> +	mutex_unlock(&kvm->lock);
>> +
>> +	return 0;
>> +
>> +fail:
>> +	if (irqfd->wqh)
>> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
>> +
>> +	if (file && !IS_ERR(file))
>> +		fput(file);
>> +
>> +	kfree(irqfd);
>> +	return ret;
>> +}
>>     
>
> It seems that this lets the guest assign an unlimited number of fds
> to the same gsi, potentially using up all of kernel memory.
>
> Since we don't need multiple fds assigned to the same gsi (instead,
> multiple processes can write to the same eventfd to trigger an
> interrupt) let's simply check that no fd is yet assigned to this gsi.
>   

I think Avi asked for this specific feature during review which is the
reason why its there today.  However, I agree that it would probably be
a good idea to put an upper limit on the number of supported aliases
that can be registered.  Will fix.

Thanks Michael,

-Greg
Michael S. Tsirkin June 14, 2009, 1:19 p.m. UTC | #15
On Sun, Jun 14, 2009 at 08:40:57AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Wed, May 20, 2009 at 10:30:49AM -0400, Gregory Haskins wrote:
> >
> > ...
> >
> >   
> >> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> >> +static int
> >> +kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> >> +{
> >> +	struct _irqfd *irqfd;
> >> +	struct file *file = NULL;
> >> +	int ret;
> >> +
> >> +	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> >> +	if (!irqfd)
> >> +		return -ENOMEM;
> >> +
> >> +	irqfd->kvm = kvm;
> >> +	irqfd->gsi = gsi;
> >> +	INIT_LIST_HEAD(&irqfd->list);
> >> +	INIT_WORK(&irqfd->work, irqfd_inject);
> >> +
> >> +	/*
> >> +	 * Embed the file* lifetime in the irqfd.
> >> +	 */
> >> +	file = fget(fd);
> >> +	if (IS_ERR(file)) {
> >> +		ret = PTR_ERR(file);
> >> +		goto fail;
> >> +	}
> >> +
> >> +	/*
> >> +	 * Install our own custom wake-up handling so we are notified via
> >> +	 * a callback whenever someone signals the underlying eventfd
> >> +	 */
> >> +	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
> >> +	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
> >> +
> >> +	ret = file->f_op->poll(file, &irqfd->pt);
> >> +	if (ret < 0)
> >> +		goto fail;
> >> +
> >> +	irqfd->file = file;
> >> +
> >> +	mutex_lock(&kvm->lock);
> >> +	list_add_tail(&irqfd->list, &kvm->irqfds);
> >> +	mutex_unlock(&kvm->lock);
> >> +
> >> +	return 0;
> >> +
> >> +fail:
> >> +	if (irqfd->wqh)
> >> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >> +
> >> +	if (file && !IS_ERR(file))
> >> +		fput(file);
> >> +
> >> +	kfree(irqfd);
> >> +	return ret;
> >> +}
> >>     
> >
> > It seems that this lets the guest assign an unlimited number of fds
> > to the same gsi, potentially using up all of kernel memory.
> >
> > Since we don't need multiple fds assigned to the same gsi (instead,
> > multiple processes can write to the same eventfd to trigger an
> > interrupt) let's simply check that no fd is yet assigned to this gsi.
> >   
> 
> I think Avi asked for this specific feature during review which is the
> reason why its there today.  However, I agree that it would probably be
> a good idea to put an upper limit on the number of supported aliases
> that can be registered.  Will fix.
> 
> Thanks Michael,
> 
> -Greg
> 
> 


Avi, can you elaborate on why do we want to map multiple fds
to the same gsi? I think it's better to allow a 1:1 mapping
only: if many processes want to trigger interrupts they can
all write to the same fd.
Michael S. Tsirkin June 14, 2009, 1:20 p.m. UTC | #16
On Sun, Jun 14, 2009 at 08:25:43AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 11, 2009 at 04:16:47PM +0300, Michael S. Tsirkin wrote:
> >   
> >>> +
> >>> +	ret = file->f_op->poll(file, &irqfd->pt);
> >>> +	if (ret < 0)
> >>> +		goto fail;
> >>>       
> >
> > Looking at it some more, we have:
> > struct file_operations {
> > ....
> > 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
> >
> > So the comparison above does not seem to make sense:
> > it seems that the return value from poll can not be negative.
> >   
> 
> Indeed.  Will fix.
> > Will the callback be executed if someone did a write to eventfd
> > before we attached it? If no, maybe we should call it here
> > if ret != 0.
> >   
> 
> I do the cleanup in case the callback has been called, but poll() fails
> somewhere internally afterwards.  Perhaps this is not a realistic
> scenario, but it was my motivation for adding the wqh cleanup.

Could it be that poll returns the event mask and you mistake it for
error?

> >   
> >>> +
> >>> +	irqfd->file = file;
> >>> +
> >>> +	mutex_lock(&kvm->lock);
> >>> +	list_add_tail(&irqfd->list, &kvm->irqfds);
> >>> +	mutex_unlock(&kvm->lock);
> >>> +
> >>> +	return 0;
> >>> +
> >>> +fail:
> >>> +	if (irqfd->wqh)
> >>> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >>>       
> >> Why are these 2 lines here? Either we might get a callback even though
> >> poll failed - and then this test without lock is probably racy -
> >> or we can't, and then we can replace the above with BUG_ON(irqfd->wqh).
> >>
> >> Which is it? I think the later ...
> >>
> >>
> >>     
> >>> +
> >>> +	if (file && !IS_ERR(file))
> >>> +		fput(file);
> >>> +
> >>> +	kfree(irqfd);
> >>> +	return ret;
> >>> +}
> >>> +
> >>>       
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity June 14, 2009, 1:23 p.m. UTC | #17
Michael S. Tsirkin wrote:

  

>> I think Avi asked for this specific feature during review which is the
>> reason why its there today.  However, I agree that it would probably be
>> a good idea to put an upper limit on the number of supported aliases
>> that can be registered.  Will fix.
>>
>> Thanks Michael,
>>
>> -Greg
>>
>>
>>     
>
>
> Avi, can you elaborate on why do we want to map multiple fds
> to the same gsi? I think it's better to allow a 1:1 mapping
> only: if many processes want to trigger interrupts they can
> all write to the same fd.
>   

I don't want to assume that the eventfds all come from the same source.

That said, we have a workaround, allocate a new gsi with the same routes 
and attach the excess eventfds there.
Michael S. Tsirkin June 14, 2009, 1:30 p.m. UTC | #18
On Sun, Jun 14, 2009 at 04:23:36PM +0300, Avi Kivity wrote:
> Michael S. Tsirkin wrote:
>
>  
>
>>> I think Avi asked for this specific feature during review which is the
>>> reason why its there today.  However, I agree that it would probably be
>>> a good idea to put an upper limit on the number of supported aliases
>>> that can be registered.  Will fix.
>>>
>>> Thanks Michael,
>>>
>>> -Greg
>>>
>>>
>>>     
>>
>>
>> Avi, can you elaborate on why do we want to map multiple fds
>> to the same gsi? I think it's better to allow a 1:1 mapping
>> only: if many processes want to trigger interrupts they can
>> all write to the same fd.
>>   
>
> I don't want to assume that the eventfds all come from the same source.
>
> That said, we have a workaround, allocate a new gsi with the same routes  
> and attach the excess eventfds there.

Right. So you are ok with 1:1 irqfd:gsi requirement for now?
This seems nicer than N:1 with an arbitrary N.
Avi Kivity June 14, 2009, 1:40 p.m. UTC | #19
Michael S. Tsirkin wrote:

  

>> I don't want to assume that the eventfds all come from the same source.
>>
>> That said, we have a workaround, allocate a new gsi with the same routes  
>> and attach the excess eventfds there.
>>     
>
> Right. So you are ok with 1:1 irqfd:gsi requirement for now?
> This seems nicer than N:1 with an arbitrary N

Not too happy, but okay.
Michael S. Tsirkin June 14, 2009, 1:50 p.m. UTC | #20
On Sun, Jun 14, 2009 at 04:40:11PM +0300, Avi Kivity wrote:
> Michael S. Tsirkin wrote:
>
>  
>
>>> I don't want to assume that the eventfds all come from the same source.
>>>
>>> That said, we have a workaround, allocate a new gsi with the same 
>>> routes  and attach the excess eventfds there.
>>>     
>>
>> Right. So you are ok with 1:1 irqfd:gsi requirement for now?
>> This seems nicer than N:1 with an arbitrary N
>
> Not too happy, but okay.

Is the answer 42:1 then, as usual?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index bee9512..01e3c61 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -2,7 +2,7 @@ 
 EXTRA_CFLAGS += -Ivirt/kvm -Iarch/x86/kvm
 
 kvm-y			+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
-				coalesced_mmio.o irq_comm.o)
+				coalesced_mmio.o irq_comm.o eventfd.o)
 kvm-$(CONFIG_KVM_TRACE)	+= $(addprefix ../../../virt/kvm/, kvm_trace.o)
 kvm-$(CONFIG_IOMMU_API)	+= $(addprefix ../../../virt/kvm/, iommu.o)
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7978d32..98c2434 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1084,6 +1084,7 @@  int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_REINJECT_CONTROL:
 	case KVM_CAP_IRQ_INJECT_STATUS:
 	case KVM_CAP_ASSIGN_DEV_IRQ:
+	case KVM_CAP_IRQFD:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 7b17141..8f53f24 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -418,6 +418,7 @@  struct kvm_trace_rec {
 #ifdef __KVM_HAVE_MCE
 #define KVM_CAP_MCE 31
 #endif
+#define KVM_CAP_IRQFD 32
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -470,6 +471,15 @@  struct kvm_x86_mce {
 };
 #endif
 
+#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
+
+struct kvm_irqfd {
+	__u32 fd;
+	__u32 gsi;
+	__u32 flags;
+	__u8  pad[20];
+};
+
 /*
  * ioctls for VM fds
  */
@@ -514,6 +524,7 @@  struct kvm_x86_mce {
 #define KVM_ASSIGN_SET_MSIX_ENTRY \
 			_IOW(KVMIO, 0x74, struct kvm_assigned_msix_entry)
 #define KVM_DEASSIGN_DEV_IRQ       _IOW(KVMIO, 0x75, struct kvm_assigned_irq)
+#define KVM_IRQFD                  _IOW(KVMIO, 0x76, struct kvm_irqfd)
 
 /*
  * ioctls for vcpu fds
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8f410d3..3b6caf5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -134,6 +134,7 @@  struct kvm {
 	struct list_head vm_list;
 	struct kvm_io_bus mmio_bus;
 	struct kvm_io_bus pio_bus;
+	struct list_head irqfds;
 	struct kvm_vm_stat stat;
 	struct kvm_arch arch;
 	atomic_t users_count;
@@ -528,4 +529,7 @@  static inline void kvm_free_irq_routing(struct kvm *kvm) {}
 
 #endif
 
+int kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags);
+void kvm_irqfd_release(struct kvm *kvm);
+
 #endif
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
new file mode 100644
index 0000000..72a282e
--- /dev/null
+++ b/virt/kvm/eventfd.c
@@ -0,0 +1,228 @@ 
+/*
+ * kvm eventfd support - use eventfd objects to signal various KVM events
+ *
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *	Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/workqueue.h>
+#include <linux/syscalls.h>
+#include <linux/wait.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/list.h>
+
+/*
+ * --------------------------------------------------------------------
+ * irqfd: Allows an fd to be used to inject an interrupt to the guest
+ *
+ * Credit goes to Avi Kivity for the original idea.
+ * --------------------------------------------------------------------
+ */
+struct _irqfd {
+	struct kvm               *kvm;
+	int                       gsi;
+	struct file              *file;
+	struct list_head          list;
+	poll_table                pt;
+	wait_queue_head_t        *wqh;
+	wait_queue_t              wait;
+	struct work_struct        work;
+};
+
+static void
+irqfd_inject(struct work_struct *work)
+{
+	struct _irqfd *irqfd = container_of(work, struct _irqfd, work);
+	struct kvm *kvm = irqfd->kvm;
+
+	mutex_lock(&kvm->lock);
+	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
+	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
+	mutex_unlock(&kvm->lock);
+}
+
+static int
+irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
+
+	/*
+	 * The wake_up is called with interrupts disabled.  Therefore we need
+	 * to defer the IRQ injection until later since we need to acquire the
+	 * kvm->lock to do so.
+	 */
+	schedule_work(&irqfd->work);
+
+	return 0;
+}
+
+static void
+irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
+			poll_table *pt)
+{
+	struct _irqfd *irqfd = container_of(pt, struct _irqfd, pt);
+
+	irqfd->wqh = wqh;
+	add_wait_queue(wqh, &irqfd->wait);
+}
+
+static int
+kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
+{
+	struct _irqfd *irqfd;
+	struct file *file = NULL;
+	int ret;
+
+	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
+	if (!irqfd)
+		return -ENOMEM;
+
+	irqfd->kvm = kvm;
+	irqfd->gsi = gsi;
+	INIT_LIST_HEAD(&irqfd->list);
+	INIT_WORK(&irqfd->work, irqfd_inject);
+
+	/*
+	 * Embed the file* lifetime in the irqfd.
+	 */
+	file = fget(fd);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto fail;
+	}
+
+	/*
+	 * Install our own custom wake-up handling so we are notified via
+	 * a callback whenever someone signals the underlying eventfd
+	 */
+	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
+	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
+
+	ret = file->f_op->poll(file, &irqfd->pt);
+	if (ret < 0)
+		goto fail;
+
+	irqfd->file = file;
+
+	mutex_lock(&kvm->lock);
+	list_add_tail(&irqfd->list, &kvm->irqfds);
+	mutex_unlock(&kvm->lock);
+
+	return 0;
+
+fail:
+	if (irqfd->wqh)
+		remove_wait_queue(irqfd->wqh, &irqfd->wait);
+
+	if (file && !IS_ERR(file))
+		fput(file);
+
+	kfree(irqfd);
+	return ret;
+}
+
+static void
+irqfd_release(struct _irqfd *irqfd)
+{
+	/*
+	 * The ordering is important.  We must remove ourselves from the wqh
+	 * first to ensure no more event callbacks are issued, and then flush
+	 * any previously scheduled work prior to freeing the memory
+	 */
+	remove_wait_queue(irqfd->wqh, &irqfd->wait);
+
+	flush_work(&irqfd->work);
+
+	fput(irqfd->file);
+	kfree(irqfd);
+}
+
+static struct _irqfd *
+irqfd_remove(struct kvm *kvm, struct file *file, int gsi)
+{
+	struct _irqfd *irqfd;
+
+	mutex_lock(&kvm->lock);
+
+	/*
+	 * linear search isn't brilliant, but this should be an infrequent
+	 * slow-path operation, and the list should not grow very large
+	 */
+	list_for_each_entry(irqfd, &kvm->irqfds, list) {
+		if (irqfd->file != file || irqfd->gsi != gsi)
+			continue;
+
+		list_del(&irqfd->list);
+		mutex_unlock(&kvm->lock);
+
+		return irqfd;
+	}
+
+	mutex_unlock(&kvm->lock);
+
+	return NULL;
+}
+
+static int
+kvm_deassign_irqfd(struct kvm *kvm, int fd, int gsi)
+{
+	struct _irqfd *irqfd;
+	struct file *file;
+	int count = 0;
+
+	file = fget(fd);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	while ((irqfd = irqfd_remove(kvm, file, gsi))) {
+		/*
+		 * We remove the item from the list under the lock, but we
+		 * free it outside the lock to avoid deadlocking with the
+		 * flush_work and the work_item taking the lock
+		 */
+		irqfd_release(irqfd);
+		count++;
+	}
+
+	fput(file);
+
+	return count ? count : -ENOENT;
+}
+
+int
+kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
+{
+	if (flags & KVM_IRQFD_FLAG_DEASSIGN)
+		return kvm_deassign_irqfd(kvm, fd, gsi);
+
+	return kvm_assign_irqfd(kvm, fd, gsi);
+}
+
+void
+kvm_irqfd_release(struct kvm *kvm)
+{
+	struct _irqfd *irqfd, *tmp;
+
+	/* don't bother with the lock..we are shutting down */
+	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
+		list_del(&irqfd->list);
+		irqfd_release(irqfd);
+	}
+}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index bebfe59..b58837d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -983,6 +983,7 @@  static struct kvm *kvm_create_vm(void)
 	atomic_inc(&kvm->mm->mm_count);
 	spin_lock_init(&kvm->mmu_lock);
 	kvm_io_bus_init(&kvm->pio_bus);
+	INIT_LIST_HEAD(&kvm->irqfds);
 	mutex_init(&kvm->lock);
 	kvm_io_bus_init(&kvm->mmio_bus);
 	init_rwsem(&kvm->slots_lock);
@@ -1034,6 +1035,7 @@  static void kvm_destroy_vm(struct kvm *kvm)
 	spin_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
 	spin_unlock(&kvm_lock);
+	kvm_irqfd_release(kvm);
 	kvm_free_irq_routing(kvm);
 	kvm_io_bus_destroy(&kvm->pio_bus);
 	kvm_io_bus_destroy(&kvm->mmio_bus);
@@ -2210,6 +2212,15 @@  static long kvm_vm_ioctl(struct file *filp,
 	}
 #endif
 #endif /* KVM_CAP_IRQ_ROUTING */
+	case KVM_IRQFD: {
+		struct kvm_irqfd data;
+
+		r = -EFAULT;
+		if (copy_from_user(&data, argp, sizeof data))
+			goto out;
+		r = kvm_irqfd(kvm, data.fd, data.gsi, data.flags);
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}