diff mbox

[RFC,15/17] kvm: add dynamic IRQ support

Message ID 20090331184405.28333.59205.stgit@dev.haskins.net (mailing list archive)
State Not Applicable
Headers show

Commit Message

Gregory Haskins March 31, 2009, 6:44 p.m. UTC
This patch provides the ability to dynamically declare and map an
interrupt-request handle to an x86 8-bit vector.

Problem Statement: Emulated devices (such as PCI, ISA, etc) have
interrupt routing done via standard PC mechanisms (MP-table, ACPI,
etc).  However, we also want to support a new class of devices
which exist in a new virtualized namespace and therefore should
not try to piggyback on these emulated mechanisms.  Rather, we
create a way to dynamically register interrupt resources that
acts indepent of the emulated counterpart.

On x86, a simplistic view of the interrupt model is that each core
has a local-APIC which can recieve messages from APIC-compliant
routing devices (such as IO-APIC and MSI) regarding details about
an interrupt (such as which vector to raise).  These routing devices
are controlled by the OS so they may translate a physical event
(such as "e1000: raise an RX interrupt") to a logical destination
(such as "inject IDT vector 46 on core 3").  A dynirq is a virtual
implementation of such a router (think of it as a virtual-MSI, but
without the coupling to an existing standard, such as PCI).

The model is simple: A guest OS can allocate the mapping of "IRQ"
handle to "vector/core" in any way it sees fit, and provide this
information to the dynirq module running in the host.  The assigned
IRQ then becomes the sole handle needed to inject an IDT vector
to the guest from a host.  A host entity that wishes to raise an
interrupt simple needs to call kvm_inject_dynirq(irq) and the routing
is performed transparently.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/Kconfig                |    5 +
 arch/x86/Makefile               |    3 
 arch/x86/include/asm/kvm_host.h |    9 +
 arch/x86/include/asm/kvm_para.h |   11 +
 arch/x86/kvm/Makefile           |    3 
 arch/x86/kvm/dynirq.c           |  329 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/guest/Makefile     |    2 
 arch/x86/kvm/guest/dynirq.c     |   95 +++++++++++
 arch/x86/kvm/x86.c              |    6 +
 include/linux/kvm.h             |    1 
 include/linux/kvm_guest.h       |    7 +
 include/linux/kvm_host.h        |    1 
 include/linux/kvm_para.h        |    1 
 13 files changed, 472 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/kvm/dynirq.c
 create mode 100644 arch/x86/kvm/guest/Makefile
 create mode 100644 arch/x86/kvm/guest/dynirq.c
 create mode 100644 include/linux/kvm_guest.h


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Avi Kivity March 31, 2009, 7:20 p.m. UTC | #1
Gregory Haskins wrote:
> This patch provides the ability to dynamically declare and map an
> interrupt-request handle to an x86 8-bit vector.
>
> Problem Statement: Emulated devices (such as PCI, ISA, etc) have
> interrupt routing done via standard PC mechanisms (MP-table, ACPI,
> etc).  However, we also want to support a new class of devices
> which exist in a new virtualized namespace and therefore should
> not try to piggyback on these emulated mechanisms.  Rather, we
> create a way to dynamically register interrupt resources that
> acts indepent of the emulated counterpart.
>
> On x86, a simplistic view of the interrupt model is that each core
> has a local-APIC which can recieve messages from APIC-compliant
> routing devices (such as IO-APIC and MSI) regarding details about
> an interrupt (such as which vector to raise).  These routing devices
> are controlled by the OS so they may translate a physical event
> (such as "e1000: raise an RX interrupt") to a logical destination
> (such as "inject IDT vector 46 on core 3").  A dynirq is a virtual
> implementation of such a router (think of it as a virtual-MSI, but
> without the coupling to an existing standard, such as PCI).
>
> The model is simple: A guest OS can allocate the mapping of "IRQ"
> handle to "vector/core" in any way it sees fit, and provide this
> information to the dynirq module running in the host.  The assigned
> IRQ then becomes the sole handle needed to inject an IDT vector
> to the guest from a host.  A host entity that wishes to raise an
> interrupt simple needs to call kvm_inject_dynirq(irq) and the routing
> is performed transparently.
>   

A major disadvantage of dynirq is that it will only work on guests which 
have been ported to it.  So this will only be useful on newer Linux, and 
will likely never work with Windows guests.

Why is having an emulated PCI device so bad?  We found that it has 
several advantages:
 - works with all guests
 - supports hotplug/hotunplug, udev, sysfs, module autoloading, ...
 - supported in all OSes
 - someone else maintains it

See also the kvm irq routing work, merged into 2.6.30, which does a 
small part of what you're describing (the "sole handle" part, specifically).
Gregory Haskins March 31, 2009, 7:39 p.m. UTC | #2
Avi Kivity wrote:
> Gregory Haskins wrote:
>> This patch provides the ability to dynamically declare and map an
>> interrupt-request handle to an x86 8-bit vector.
>>
>> Problem Statement: Emulated devices (such as PCI, ISA, etc) have
>> interrupt routing done via standard PC mechanisms (MP-table, ACPI,
>> etc).  However, we also want to support a new class of devices
>> which exist in a new virtualized namespace and therefore should
>> not try to piggyback on these emulated mechanisms.  Rather, we
>> create a way to dynamically register interrupt resources that
>> acts indepent of the emulated counterpart.
>>
>> On x86, a simplistic view of the interrupt model is that each core
>> has a local-APIC which can recieve messages from APIC-compliant
>> routing devices (such as IO-APIC and MSI) regarding details about
>> an interrupt (such as which vector to raise).  These routing devices
>> are controlled by the OS so they may translate a physical event
>> (such as "e1000: raise an RX interrupt") to a logical destination
>> (such as "inject IDT vector 46 on core 3").  A dynirq is a virtual
>> implementation of such a router (think of it as a virtual-MSI, but
>> without the coupling to an existing standard, such as PCI).
>>
>> The model is simple: A guest OS can allocate the mapping of "IRQ"
>> handle to "vector/core" in any way it sees fit, and provide this
>> information to the dynirq module running in the host.  The assigned
>> IRQ then becomes the sole handle needed to inject an IDT vector
>> to the guest from a host.  A host entity that wishes to raise an
>> interrupt simple needs to call kvm_inject_dynirq(irq) and the routing
>> is performed transparently.
>>   
>
> A major disadvantage of dynirq is that it will only work on guests
> which have been ported to it.  So this will only be useful on newer
> Linux, and will likely never work with Windows guests.
>
> Why is having an emulated PCI device so bad?  We found that it has
> several advantages:
> - works with all guests
> - supports hotplug/hotunplug, udev, sysfs, module autoloading, ...
> - supported in all OSes
> - someone else maintains it
These points are all valid, and I really struggled with this particular
part of the design.  The entire vbus design only requires one IRQ for
the entire guest, so its conceivable that I could present a simple
"dummy" PCI device with some "VBUS" type PCI-ID, just to piggy back on
the IRQ routing logic.  Then userspace could simply pass the IRQ routing
info down to the kernel with an ioctl, or something similar.

Ultimately I wasn't sure whether I wanted all that goo just to get an
IRQ assignment...but on the other hand, we have all this goo to build
one in the first place, and its half on the guest side which has the
disadvantages you mention.  So perhaps this should go in favor of a
PCI-esqe type solution, as I think you are suggesting.

I think ultimately I was trying to stay away from PCI in general because
I want to support environments that do not have PCI.  However, for the
kvm-transport case (at least on x86) this isnt really a constraint.

>
> See also the kvm irq routing work, merged into 2.6.30, which does a
> small part of what you're describing (the "sole handle" part,
> specifically).

I will take a look, thanks!

(I wish I wish you had accepted those irq patches I wrote a while back. 
It had the foundation for this type of stuff all built in.  But alas, I
think it was before its time, and I didn't do a good job of explaining
my future plans....) ;)

Regards,
-Greg
Avi Kivity March 31, 2009, 8:13 p.m. UTC | #3
Gregory Haskins wrote:
>> - works with all guests
>> - supports hotplug/hotunplug, udev, sysfs, module autoloading, ...
>> - supported in all OSes
>> - someone else maintains it
>>     
> These points are all valid, and I really struggled with this particular
> part of the design.  The entire vbus design only requires one IRQ for
> the entire guest,

Won't this have scaling issues?  One IRQ means one target vcpu.  Whereas 
I'd like virtio devices to span multiple queues, each queue with its own 
MSI IRQ.  Also, the single IRQ handler will need to scan for all 
potential IRQ sources.  Even if implemented carefully, this will cause 
many cacheline bounces.

>  so its conceivable that I could present a simple
> "dummy" PCI device with some "VBUS" type PCI-ID, just to piggy back on
> the IRQ routing logic.  Then userspace could simply pass the IRQ routing
> info down to the kernel with an ioctl, or something similar.
>   

Xen does something similar, I believe.

> I think ultimately I was trying to stay away from PCI in general because
> I want to support environments that do not have PCI.  However, for the
> kvm-transport case (at least on x86) this isnt really a constraint.
>
>   

s/PCI/the native IRQ solution for your platform/. virtio has the same 
problem; on s390 we use the native (if that word ever applies to s390) 
interrupt and device discovery mechanism.
Gregory Haskins March 31, 2009, 8:32 p.m. UTC | #4
Avi Kivity wrote:
> Gregory Haskins wrote:
>>> - works with all guests
>>> - supports hotplug/hotunplug, udev, sysfs, module autoloading, ...
>>> - supported in all OSes
>>> - someone else maintains it
>>>     
>> These points are all valid, and I really struggled with this particular
>> part of the design.  The entire vbus design only requires one IRQ for
>> the entire guest,
>
> Won't this have scaling issues?  One IRQ means one target vcpu. 
> Whereas I'd like virtio devices to span multiple queues, each queue
> with its own MSI IRQ.
Hmm..you know I hadnt really thought of it that way, but you have a
point.  To clarify, my design actually uses one IRQ per "eventq", where
we can have an arbitrary number of eventq's defined (note: today I only
define one eventq, however).  An eventq is actually a shm-ring construct
where I can pass events up to the host like "device added" or "ring X
signaled".  Each individual device based virtio-ring would then
aggregates "signal" events onto this eventq mechanism to actually inject
events to the host.  Only the eventq itself injects an actual IRQ to the
assigned vcpu.

My intended use of multiple eventqs was for prioritization of different
rings.  For instance, we could define 8 priority levels, each with its
own ring/irq.  That way, a virtio-net that supports something like
802.1p could define 8 virtio-rings, one for each priority level.

But this scheme is more targeted at prioritization than per vcpu
irq-balancing.  I support the eventq construct I proposed could still be
used in this fashion since each has its own routable IRQ.  However, I
would have to think about that some more because it is beyond the design
spec.

The good news is that the decision to use the "eventq+irq" approach is
completely contained in the kvm-host+guest.patch.  We could easily
switch to a 1:1 irq:shm-signal if we wanted to, and the device/drivers
would work exactly the same without modification.

>   Also, the single IRQ handler will need to scan for all potential IRQ
> sources.  Even if implemented carefully, this will cause many
> cacheline bounces.
Well, no, I think this part is covered.  As mentioned above, we use a
queuing technique so there is no scanning needed.  Ultimately I would
love to adapt a similar technique to optionally replace the LAPIC.  That
way we can avoid the EOI trap and just consume the next interrupt (if
applicable) from the shm-ring.

>
>>  so its conceivable that I could present a simple
>> "dummy" PCI device with some "VBUS" type PCI-ID, just to piggy back on
>> the IRQ routing logic.  Then userspace could simply pass the IRQ routing
>> info down to the kernel with an ioctl, or something similar.
>>   
>
> Xen does something similar, I believe.
>
>> I think ultimately I was trying to stay away from PCI in general because
>> I want to support environments that do not have PCI.  However, for the
>> kvm-transport case (at least on x86) this isnt really a constraint.
>>
>>   
>
> s/PCI/the native IRQ solution for your platform/. virtio has the same
> problem; on s390 we use the native (if that word ever applies to s390)
> interrupt and device discovery mechanism.

yeah, I agree.  We can contain the "exposure" of PCI to just platforms
within KVM that care about it.

-Greg
Avi Kivity March 31, 2009, 8:59 p.m. UTC | #5
Gregory Haskins wrote:
>> Won't this have scaling issues?  One IRQ means one target vcpu. 
>> Whereas I'd like virtio devices to span multiple queues, each queue
>> with its own MSI IRQ.
>>     
> Hmm..you know I hadnt really thought of it that way, but you have a
> point.  To clarify, my design actually uses one IRQ per "eventq", where
> we can have an arbitrary number of eventq's defined (note: today I only
> define one eventq, however).  An eventq is actually a shm-ring construct
> where I can pass events up to the host like "device added" or "ring X
> signaled".  Each individual device based virtio-ring would then
> aggregates "signal" events onto this eventq mechanism to actually inject
> events to the host.  Only the eventq itself injects an actual IRQ to the
> assigned vcpu.
>   

You will get get cachelines bounced around when events from different 
devices are added to the queue.  On the plus side, a single injection 
can contain interrupts for multiple devices.

I'm not sure how useful this coalescing is; certainly you will never see 
it on microbenchmarks, but that doesn't mean it's not useful.
diff mbox

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3fca247..91fefd5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -446,6 +446,11 @@  config KVM_GUEST
 	 This option enables various optimizations for running under the KVM
 	 hypervisor.
 
+config KVM_GUEST_DYNIRQ
+       bool "KVM Dynamic IRQ support"
+       depends on KVM_GUEST
+       default y
+
 source "arch/x86/lguest/Kconfig"
 
 config PARAVIRT
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index d1a47ad..d788815 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -147,6 +147,9 @@  core-$(CONFIG_XEN) += arch/x86/xen/
 # lguest paravirtualization support
 core-$(CONFIG_LGUEST_GUEST) += arch/x86/lguest/
 
+# kvm paravirtualization support
+core-$(CONFIG_KVM_GUEST) += arch/x86/kvm/guest/
+
 core-y += arch/x86/kernel/
 core-y += arch/x86/mm/
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 730843d..9ae398a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -346,6 +346,12 @@  struct kvm_mem_alias {
 	gfn_t target_gfn;
 };
 
+struct kvm_dynirq {
+	spinlock_t lock;
+	struct rb_root map;
+	struct kvm *kvm;
+};
+
 struct kvm_arch{
 	int naliases;
 	struct kvm_mem_alias aliases[KVM_ALIAS_SLOTS];
@@ -363,6 +369,7 @@  struct kvm_arch{
 	struct iommu_domain *iommu_domain;
 	struct kvm_pic *vpic;
 	struct kvm_ioapic *vioapic;
+	struct kvm_dynirq *dynirq;
 	struct kvm_pit *vpit;
 	struct hlist_head irq_ack_notifier_list;
 	int vapics_in_nmi_mode;
@@ -519,6 +526,8 @@  int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 			  const void *val, int bytes);
 int kvm_pv_mmu_op(struct kvm_vcpu *vcpu, unsigned long bytes,
 		  gpa_t addr, unsigned long *ret);
+int kvm_dynirq_hc(struct kvm_vcpu *vcpu, int nr, gpa_t gpa, size_t len);
+void kvm_free_dynirq(struct kvm *kvm);
 
 extern bool tdp_enabled;
 
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index b8a3305..fba210e 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -13,6 +13,7 @@ 
 #define KVM_FEATURE_CLOCKSOURCE		0
 #define KVM_FEATURE_NOP_IO_DELAY	1
 #define KVM_FEATURE_MMU_OP		2
+#define KVM_FEATURE_DYNIRQ		3
 
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
@@ -45,6 +46,16 @@  struct kvm_mmu_op_release_pt {
 	__u64 pt_phys;
 };
 
+/* Operations for KVM_HC_DYNIRQ */
+#define KVM_DYNIRQ_OP_SET   1
+#define KVM_DYNIRQ_OP_CLEAR 2
+
+struct kvm_dynirq_set {
+	__u32 irq;
+	__u32 vec;  /* x86 IDT vector */
+	__u32 dest; /* 0-based vcpu id */
+};
+
 #ifdef __KERNEL__
 #include <asm/processor.h>
 
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index d3ec292..d5676f5 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -14,9 +14,10 @@  endif
 EXTRA_CFLAGS += -Ivirt/kvm -Iarch/x86/kvm
 
 kvm-objs := $(common-objs) x86.o mmu.o x86_emulate.o i8259.o irq.o lapic.o \
-	i8254.o
+	i8254.o dynirq.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
 kvm-amd-objs = svm.o
 obj-$(CONFIG_KVM_AMD) += kvm-amd.o
+
diff --git a/arch/x86/kvm/dynirq.c b/arch/x86/kvm/dynirq.c
new file mode 100644
index 0000000..54162dd
--- /dev/null
+++ b/arch/x86/kvm/dynirq.c
@@ -0,0 +1,329 @@ 
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Dynamic-Interrupt-Request (dynirq): This module provides the ability
+ * to dynamically declare and map an interrupt-request handle to an
+ * x86 8-bit vector.
+ *
+ * Problem Statement: Emulated devices (such as PCI, ISA, etc) have
+ * interrupt routing done via standard PC mechanisms (MP-table, ACPI,
+ * etc).  However, we also want to support a new class of devices
+ * which exist in a new virtualized namespace and therefore should
+ * not try to piggyback on these emulated mechanisms.  Rather, we
+ * create a way to dynamically register interrupt resources that
+ * acts indepent of the emulated counterpart.
+ *
+ * On x86, a simplistic view of the interrupt model is that each core
+ * has a local-APIC which can recieve messages from APIC-compliant
+ * routing devices (such as IO-APIC and MSI) regarding details about
+ * an interrupt (such as which vector to raise).  These routing devices
+ * are controlled by the OS so they may translate a physical event
+ * (such as "e1000: raise an RX interrupt") to a logical destination
+ * (such as "inject IDT vector 46 on core 3").  A dynirq is a virtual
+ * implementation of such a router (think of it as a virtual-MSI, but
+ * without the coupling to an existing standard, such as PCI).
+ *
+ * The model is simple: A guest OS can allocate the mapping of "IRQ"
+ * handle to "vector/core" in any way it sees fit, and provide this
+ * information to the dynirq module running in the host.  The assigned
+ * IRQ then becomes the sole handle needed to inject an IDT vector
+ * to the guest from a host.  A host entity that wishes to raise an
+ * interrupt simple needs to call kvm_inject_dynirq(irq) and the routing
+ * is performed transparently.
+ *
+ * Author:
+ *	Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#include <linux/module.h>
+#include <linux/rbtree.h>
+#include <linux/mutex.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+
+#include <linux/kvm.h>
+#include <linux/kvm_host.h>
+#include <linux/kvm_para.h>
+#include <linux/workqueue.h>
+#include <linux/hardirq.h>
+
+#include "lapic.h"
+
+struct dynirq {
+	struct kvm_dynirq *parent;
+	unsigned int       irq;
+	unsigned short     vec;
+	unsigned int       dest;
+	struct rb_node     node;
+	struct work_struct work;
+};
+
+static inline struct dynirq *
+to_dynirq(struct rb_node *node)
+{
+	return node ? container_of(node, struct dynirq, node) : NULL;
+}
+
+static int
+map_add(struct rb_root *root, struct dynirq *entry)
+{
+	int		ret = 0;
+	struct rb_node **new, *parent = NULL;
+	struct rb_node *node = &entry->node;
+
+	new  = &(root->rb_node);
+
+	/* Figure out where to put new node */
+	while (*new) {
+		int val;
+
+		parent = *new;
+
+		val = to_dynirq(node)->irq - to_dynirq(*new)->irq;
+		if (val < 0)
+			new = &((*new)->rb_left);
+		else if (val > 0)
+			new = &((*new)->rb_right);
+		else {
+			ret = -EEXIST;
+			break;
+		}
+	}
+
+	if (!ret) {
+		/* Add new node and rebalance tree. */
+		rb_link_node(node, parent, new);
+		rb_insert_color(node, root);
+	}
+
+	return ret;
+}
+
+static struct dynirq *
+map_find(struct rb_root *root, unsigned int key)
+{
+	struct rb_node *node;
+
+	node = root->rb_node;
+
+	while (node) {
+		int val;
+
+		val = key - to_dynirq(node)->irq;
+		if (val < 0)
+			node = node->rb_left;
+		else if (val > 0)
+			node = node->rb_right;
+		else
+			break;
+	}
+
+	return to_dynirq(node);
+}
+
+static void
+dynirq_add(struct kvm_dynirq *dynirq, struct dynirq *entry)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&dynirq->lock, flags);
+	ret = map_add(&dynirq->map, entry);
+	spin_unlock_irqrestore(&dynirq->lock, flags);
+}
+
+static struct dynirq *
+dynirq_find(struct kvm_dynirq *dynirq, int irq)
+{
+	struct dynirq *entry;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dynirq->lock, flags);
+	entry = map_find(&dynirq->map, irq);
+	spin_unlock_irqrestore(&dynirq->lock, flags);
+
+	return entry;
+}
+
+static int
+_kvm_inject_dynirq(struct kvm *kvm, struct dynirq *entry)
+{
+	struct kvm_vcpu *vcpu;
+	int ret;
+
+	mutex_lock(&kvm->lock);
+
+	vcpu = kvm->vcpus[entry->dest];
+	if (!vcpu) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	ret = kvm_apic_set_irq(vcpu, entry->vec, 1);
+
+out:
+	mutex_unlock(&kvm->lock);
+
+	return ret;
+}
+
+static void
+deferred_inject_dynirq(struct work_struct *work)
+{
+	struct dynirq *entry = container_of(work, struct dynirq, work);
+	struct kvm_dynirq *dynirq = entry->parent;
+	struct kvm *kvm = dynirq->kvm;
+
+	_kvm_inject_dynirq(kvm, entry);
+}
+
+int
+kvm_inject_dynirq(struct kvm *kvm, int irq)
+{
+	struct kvm_dynirq *dynirq = kvm->arch.dynirq;
+	struct dynirq *entry;
+
+	entry = dynirq_find(dynirq, irq);
+	if (!entry)
+		return -EINVAL;
+
+	if (preemptible())
+		return _kvm_inject_dynirq(kvm, entry);
+
+	schedule_work(&entry->work);
+	return 0;
+}
+
+static int
+hc_set(struct kvm_vcpu *vcpu, gpa_t gpa, size_t len)
+{
+	struct kvm_dynirq_set args;
+	struct kvm_dynirq    *dynirq = vcpu->kvm->arch.dynirq;
+	struct dynirq        *entry;
+	int                   ret;
+
+	if (len != sizeof(args))
+		return -EINVAL;
+
+	ret = kvm_read_guest(vcpu->kvm, gpa, &args, len);
+	if (ret < 0)
+		return ret;
+
+	if (args.dest >= KVM_MAX_VCPUS)
+		return -EINVAL;
+
+	entry = dynirq_find(dynirq, args.irq);
+	if (!entry) {
+		entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+		INIT_WORK(&entry->work, deferred_inject_dynirq);
+	} else
+		rb_erase(&entry->node, &dynirq->map);
+
+	entry->irq  = args.irq;
+	entry->vec  = args.vec;
+	entry->dest = args.dest;
+
+	dynirq_add(dynirq, entry);
+
+	return 0;
+}
+
+static int
+hc_clear(struct kvm_vcpu *vcpu, gpa_t gpa, size_t len)
+{
+	struct kvm_dynirq *dynirq = vcpu->kvm->arch.dynirq;
+	struct dynirq *entry;
+	unsigned long flags;
+	u32 irq;
+	int ret;
+
+	if (len != sizeof(irq))
+		return -EINVAL;
+
+	ret = kvm_read_guest(vcpu->kvm, gpa, &irq, len);
+	if (ret < 0)
+		return ret;
+
+	spin_lock_irqsave(&dynirq->lock, flags);
+
+	entry = map_find(&dynirq->map, irq);
+	if (entry)
+		rb_erase(&entry->node, &dynirq->map);
+
+	spin_unlock_irqrestore(&dynirq->lock, flags);
+
+	if (!entry)
+		return -ENOENT;
+
+	kfree(entry);
+	return 0;
+}
+
+/*
+ * Our hypercall format will always follow with the call-id in arg[0],
+ * a pointer to the arguments in arg[1], and the argument length in arg[2]
+ */
+int
+kvm_dynirq_hc(struct kvm_vcpu *vcpu, int nr, gpa_t gpa, size_t len)
+{
+	int ret = -EINVAL;
+
+	mutex_lock(&vcpu->kvm->lock);
+
+	if (unlikely(!vcpu->kvm->arch.dynirq)) {
+		struct kvm_dynirq *dynirq;
+
+		dynirq = kzalloc(sizeof(*dynirq), GFP_KERNEL);
+		if (!dynirq)
+			return -ENOMEM;
+
+		spin_lock_init(&dynirq->lock);
+		dynirq->map = RB_ROOT;
+		dynirq->kvm = vcpu->kvm;
+		vcpu->kvm->arch.dynirq = dynirq;
+	}
+
+	switch (nr) {
+	case KVM_DYNIRQ_OP_SET:
+		ret = hc_set(vcpu, gpa, len);
+		break;
+	case KVM_DYNIRQ_OP_CLEAR:
+		ret = hc_clear(vcpu, gpa, len);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	mutex_unlock(&vcpu->kvm->lock);
+
+	return ret;
+}
+
+void
+kvm_free_dynirq(struct kvm *kvm)
+{
+	struct kvm_dynirq *dynirq = kvm->arch.dynirq;
+	struct rb_node *node;
+
+	while ((node = rb_first(&dynirq->map))) {
+		struct dynirq *entry = to_dynirq(node);
+
+		rb_erase(node, &dynirq->map);
+		kfree(entry);
+	}
+
+	kfree(dynirq);
+}
diff --git a/arch/x86/kvm/guest/Makefile b/arch/x86/kvm/guest/Makefile
new file mode 100644
index 0000000..de8f824
--- /dev/null
+++ b/arch/x86/kvm/guest/Makefile
@@ -0,0 +1,2 @@ 
+
+obj-$(CONFIG_KVM_GUEST_DYNIRQ) += dynirq.o
\ No newline at end of file
diff --git a/arch/x86/kvm/guest/dynirq.c b/arch/x86/kvm/guest/dynirq.c
new file mode 100644
index 0000000..a5cf55e
--- /dev/null
+++ b/arch/x86/kvm/guest/dynirq.c
@@ -0,0 +1,95 @@ 
+#include <linux/module.h>
+#include <linux/irq.h>
+#include <linux/kvm.h>
+#include <linux/kvm_para.h>
+
+#include <asm/irq.h>
+#include <asm/apic.h>
+
+/*
+ * -----------------------
+ * Dynamic-IRQ support
+ * -----------------------
+ */
+
+static int dynirq_set(int irq, int dest)
+{
+	struct kvm_dynirq_set op = {
+		.irq  = irq,
+		.vec  = irq_to_vector(irq),
+		.dest = dest,
+	};
+
+	return kvm_hypercall3(KVM_HC_DYNIRQ, KVM_DYNIRQ_OP_SET,
+			      __pa(&op), sizeof(op));
+}
+
+static void dynirq_chip_noop(unsigned int irq)
+{
+}
+
+static void dynirq_chip_eoi(unsigned int irq)
+{
+	ack_APIC_irq();
+}
+
+struct irq_chip kvm_irq_chip = {
+	.name		= "KVM-DYNIRQ",
+	.mask		= dynirq_chip_noop,
+	.unmask		= dynirq_chip_noop,
+	.eoi		= dynirq_chip_eoi,
+};
+
+int create_kvm_dynirq(int cpu)
+{
+	const cpumask_t *mask = get_cpu_mask(cpu);
+	int irq;
+	int ret;
+
+	ret = kvm_para_has_feature(KVM_FEATURE_DYNIRQ);
+	if (!ret)
+		return -ENOENT;
+
+	irq = create_irq();
+	if (irq < 0)
+		return -ENOSPC;
+
+#ifdef CONFIG_SMP
+	ret = set_irq_affinity(irq, *mask);
+	if (ret < 0)
+		goto error;
+#endif
+
+	set_irq_chip_and_handler_name(irq,
+				      &kvm_irq_chip,
+				      handle_percpu_irq,
+				      "apiceoi");
+
+	ret = dynirq_set(irq, cpu);
+	if (ret < 0)
+		goto error;
+
+	return irq;
+
+error:
+	destroy_irq(irq);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(create_kvm_dynirq);
+
+int destroy_kvm_dynirq(int irq)
+{
+	__u32 _irq = irq;
+
+	if (kvm_para_has_feature(KVM_FEATURE_DYNIRQ))
+		kvm_hypercall3(KVM_HC_DYNIRQ,
+			       KVM_DYNIRQ_OP_CLEAR,
+			       __pa(&_irq),
+			       sizeof(_irq));
+
+	destroy_irq(irq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(destroy_kvm_dynirq);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9b0a649..e24f0a5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -972,6 +972,7 @@  int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_MP_STATE:
 	case KVM_CAP_SYNC_MMU:
 	case KVM_CAP_RESET:
+	case KVM_CAP_DYNIRQ:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -2684,6 +2685,9 @@  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	case KVM_HC_MMU_OP:
 		r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2), &ret);
 		break;
+	case KVM_HC_DYNIRQ:
+		ret = kvm_dynirq_hc(vcpu, a0, a1, a2);
+		break;
 	default:
 		ret = -KVM_ENOSYS;
 		break;
@@ -4141,6 +4145,8 @@  void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_free_pit(kvm);
 	kfree(kvm->arch.vpic);
 	kfree(kvm->arch.vioapic);
+	if (kvm->arch.dynirq)
+		kvm_free_dynirq(kvm);
 	kvm_free_vcpus(kvm);
 	kvm_free_physmem(kvm);
 	if (kvm->arch.apic_access_page)
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 7ffd8f5..349d273 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -397,6 +397,7 @@  struct kvm_trace_rec {
 #define KVM_CAP_USER_NMI 22
 #endif
 #define KVM_CAP_RESET 23
+#define KVM_CAP_DYNIRQ 24
 
 /*
  * ioctls for VM fds
diff --git a/include/linux/kvm_guest.h b/include/linux/kvm_guest.h
new file mode 100644
index 0000000..7dd7930
--- /dev/null
+++ b/include/linux/kvm_guest.h
@@ -0,0 +1,7 @@ 
+#ifndef __LINUX_KVM_GUEST_H
+#define __LINUX_KVM_GUEST_H
+
+extern int create_kvm_dynirq(int cpu);
+extern int destroy_kvm_dynirq(int irq);
+
+#endif /* __LINUX_KVM_GUEST_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 506eca1..bec9b35 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -297,6 +297,7 @@  int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *v);
 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
+int kvm_inject_dynirq(struct kvm *kvm, int irq);
 
 int kvm_is_mmio_pfn(pfn_t pfn);
 
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index 3ddce03..a2de904 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -16,6 +16,7 @@ 
 
 #define KVM_HC_VAPIC_POLL_IRQ		1
 #define KVM_HC_MMU_OP			2
+#define KVM_HC_DYNIRQ			3
 
 /*
  * hypercalls use architecture specific