diff mbox

[23/31] nVMX: Correct handling of interrupt injection

Message ID 201105161955.p4GJtgKc001996@rice.haifa.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Nadav Har'El May 16, 2011, 7:55 p.m. UTC
The code in this patch correctly emulates external-interrupt injection
while a nested guest L2 is running.

Because of this code's relative un-obviousness, I include here a longer-than-
usual justification for what it does - much longer than the code itself ;-)

To understand how to correctly emulate interrupt injection while L2 is
running, let's look first at what we need to emulate: How would things look
like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
an interrupt, we had hardware delivering an interrupt?

Now we have L1 running on bare metal with a guest L2, and the hardware
generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to 1, and 
VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below), what
happens now is this: The processor exits from L2 to L1, with an external-
interrupt exit reason but without an interrupt vector. L1 runs, with
interrupts disabled, and it doesn't yet know what the interrupt was. Soon
after, it enables interrupts and only at that moment, it gets the interrupt
from the processor. when L1 is KVM, Linux handles this interrupt.

Now we need exactly the same thing to happen when that L1->L2 system runs
on top of L0, instead of real hardware. This is how we do this:

When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
external-interrupt exit reason (with an invalid interrupt vector), and run L1.
Just like in the bare metal case, it likely can't deliver the interrupt to
L1 now because L1 is running with interrupts disabled, in which case it turns
on the interrupt window when running L1 after the exit. L1 will soon enable
interrupts, and at that point L0 will gain control again and inject the
interrupt to L1.

Finally, there is an extra complication in the code: when nested_run_pending,
we cannot return to L1 now, and must launch L2. We need to remember the
interrupt we wanted to inject (and not clear it now), and do it on the
next exit.

The above explanation shows that the relative strangeness of the nested
interrupt injection code in this patch, and the extra interrupt-window
exit incurred, are in fact necessary for accurate emulation, and are not
just an unoptimized implementation.

Let's revisit now the two assumptions made above:

If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
does, by the way), things are simple: L0 may inject the interrupt directly
to the L2 guest - using the normal code path that injects to any guest.
We support this case in the code below.

If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
does), things look very different from the description above: L1 expects
to see an exit from L2 with the interrupt vector already filled in the exit
information, and does not expect to be interrupted again with this interrupt.
The current code does not (yet) support this case, so we do not allow the
VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Tian, Kevin May 25, 2011, 8:39 a.m. UTC | #1
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:56 AM
> 
> The code in this patch correctly emulates external-interrupt injection
> while a nested guest L2 is running.
> 
> Because of this code's relative un-obviousness, I include here a longer-than-
> usual justification for what it does - much longer than the code itself ;-)
> 
> To understand how to correctly emulate interrupt injection while L2 is
> running, let's look first at what we need to emulate: How would things look
> like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
> an interrupt, we had hardware delivering an interrupt?
> 
> Now we have L1 running on bare metal with a guest L2, and the hardware
> generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to
> 1, and
> VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below),
> what
> happens now is this: The processor exits from L2 to L1, with an external-
> interrupt exit reason but without an interrupt vector. L1 runs, with
> interrupts disabled, and it doesn't yet know what the interrupt was. Soon
> after, it enables interrupts and only at that moment, it gets the interrupt
> from the processor. when L1 is KVM, Linux handles this interrupt.
> 
> Now we need exactly the same thing to happen when that L1->L2 system runs
> on top of L0, instead of real hardware. This is how we do this:
> 
> When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
> external-interrupt exit reason (with an invalid interrupt vector), and run L1.
> Just like in the bare metal case, it likely can't deliver the interrupt to
> L1 now because L1 is running with interrupts disabled, in which case it turns
> on the interrupt window when running L1 after the exit. L1 will soon enable
> interrupts, and at that point L0 will gain control again and inject the
> interrupt to L1.
> 
> Finally, there is an extra complication in the code: when nested_run_pending,
> we cannot return to L1 now, and must launch L2. We need to remember the
> interrupt we wanted to inject (and not clear it now), and do it on the
> next exit.
> 
> The above explanation shows that the relative strangeness of the nested
> interrupt injection code in this patch, and the extra interrupt-window
> exit incurred, are in fact necessary for accurate emulation, and are not
> just an unoptimized implementation.
> 
> Let's revisit now the two assumptions made above:
> 
> If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
> does, by the way), things are simple: L0 may inject the interrupt directly
> to the L2 guest - using the normal code path that injects to any guest.
> We support this case in the code below.
> 
> If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
> does), things look very different from the description above: L1 expects

Type-1 bare metal hypervisor may enable this bit, such as Xen. This bit is
really prepared for L2 hypervisor since normally L2 hypervisor is tricky to
touch generic interrupt logic, and thus better to not ack it until interrupt
is enabled and then hardware will gear to the kernel interrupt handler
automatically.

> to see an exit from L2 with the interrupt vector already filled in the exit
> information, and does not expect to be interrupted again with this interrupt.
> The current code does not (yet) support this case, so we do not allow the
> VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.

Then just fill the interrupt vector field with the highest unmasked bit
from pending vIRR.

> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |   36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> @@ -1788,6 +1788,7 @@ static __init void nested_vmx_setup_ctls
> 
>  	/* exit controls */
>  	nested_vmx_exit_ctls_low = 0;
> +	/* Note that guest use of VM_EXIT_ACK_INTR_ON_EXIT is not supported.
> */
>  #ifdef CONFIG_X86_64
>  	nested_vmx_exit_ctls_high = VM_EXIT_HOST_ADDR_SPACE_SIZE;
>  #else
> @@ -3733,9 +3734,25 @@ out:
>  	return ret;
>  }
> 
> +/*
> + * In nested virtualization, check if L1 asked to exit on external interrupts.
> + * For most existing hypervisors, this will always return true.
> + */
> +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
> +{
> +	return get_vmcs12(vcpu)->pin_based_vm_exec_control &
> +		PIN_BASED_EXT_INTR_MASK;
> +}
> +

could be a similar common wrapper like nested_cpu_has...

Thanks,
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tian, Kevin May 25, 2011, 8:45 a.m. UTC | #2
> From: Tian, Kevin
> Sent: Wednesday, May 25, 2011 4:40 PM
> > If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
> > does), things look very different from the description above: L1 expects
> 
> Type-1 bare metal hypervisor may enable this bit, such as Xen. This bit is
> really prepared for L2 hypervisor since normally L2 hypervisor is tricky to
> touch generic interrupt logic, and thus better to not ack it until interrupt
> is enabled and then hardware will gear to the kernel interrupt handler
> automatically.
> 
> > to see an exit from L2 with the interrupt vector already filled in the exit
> > information, and does not expect to be interrupted again with this interrupt.
> > The current code does not (yet) support this case, so we do not allow the
> > VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.
> 
> Then just fill the interrupt vector field with the highest unmasked bit
> from pending vIRR.
> 

And also ack virtual irqchip accordingly...

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tian, Kevin May 25, 2011, 9:18 a.m. UTC | #3
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:56 AM
> 
> The code in this patch correctly emulates external-interrupt injection
> while a nested guest L2 is running.
> 
> Because of this code's relative un-obviousness, I include here a longer-than-
> usual justification for what it does - much longer than the code itself ;-)
> 
> To understand how to correctly emulate interrupt injection while L2 is
> running, let's look first at what we need to emulate: How would things look
> like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
> an interrupt, we had hardware delivering an interrupt?
> 
> Now we have L1 running on bare metal with a guest L2, and the hardware
> generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to
> 1, and
> VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below),
> what
> happens now is this: The processor exits from L2 to L1, with an external-
> interrupt exit reason but without an interrupt vector. L1 runs, with
> interrupts disabled, and it doesn't yet know what the interrupt was. Soon
> after, it enables interrupts and only at that moment, it gets the interrupt
> from the processor. when L1 is KVM, Linux handles this interrupt.
> 
> Now we need exactly the same thing to happen when that L1->L2 system runs
> on top of L0, instead of real hardware. This is how we do this:
> 
> When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
> external-interrupt exit reason (with an invalid interrupt vector), and run L1.
> Just like in the bare metal case, it likely can't deliver the interrupt to
> L1 now because L1 is running with interrupts disabled, in which case it turns
> on the interrupt window when running L1 after the exit. L1 will soon enable
> interrupts, and at that point L0 will gain control again and inject the
> interrupt to L1.
> 
> Finally, there is an extra complication in the code: when nested_run_pending,
> we cannot return to L1 now, and must launch L2. We need to remember the
> interrupt we wanted to inject (and not clear it now), and do it on the
> next exit.
> 
> The above explanation shows that the relative strangeness of the nested
> interrupt injection code in this patch, and the extra interrupt-window
> exit incurred, are in fact necessary for accurate emulation, and are not
> just an unoptimized implementation.
> 
> Let's revisit now the two assumptions made above:
> 
> If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
> does, by the way), things are simple: L0 may inject the interrupt directly
> to the L2 guest - using the normal code path that injects to any guest.
> We support this case in the code below.
> 
> If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
> does), things look very different from the description above: L1 expects
> to see an exit from L2 with the interrupt vector already filled in the exit
> information, and does not expect to be interrupted again with this interrupt.
> The current code does not (yet) support this case, so we do not allow the
> VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |   36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> @@ -1788,6 +1788,7 @@ static __init void nested_vmx_setup_ctls
> 
>  	/* exit controls */
>  	nested_vmx_exit_ctls_low = 0;
> +	/* Note that guest use of VM_EXIT_ACK_INTR_ON_EXIT is not supported.
> */
>  #ifdef CONFIG_X86_64
>  	nested_vmx_exit_ctls_high = VM_EXIT_HOST_ADDR_SPACE_SIZE;
>  #else
> @@ -3733,9 +3734,25 @@ out:
>  	return ret;
>  }
> 
> +/*
> + * In nested virtualization, check if L1 asked to exit on external interrupts.
> + * For most existing hypervisors, this will always return true.
> + */
> +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
> +{
> +	return get_vmcs12(vcpu)->pin_based_vm_exec_control &
> +		PIN_BASED_EXT_INTR_MASK;
> +}
> +
>  static void enable_irq_window(struct kvm_vcpu *vcpu)
>  {
>  	u32 cpu_based_vm_exec_control;
> +	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
> +		/* We can get here when nested_run_pending caused
> +		 * vmx_interrupt_allowed() to return false. In this case, do
> +		 * nothing - the interrupt will be injected later.
> +		 */

I think this is not a rare path? when vcpu is in guest mode with L2 as current
vmx context, this function could be invoked multiple times since kvm thread
can be scheduled out/in randomly. 

> +		return;
> 
>  	cpu_based_vm_exec_control =
> vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>  	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
> @@ -3858,6 +3875,17 @@ static void vmx_set_nmi_mask(struct kvm_
> 
>  static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
>  {
> +	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
> +		struct vmcs12 *vmcs12;
> +		if (to_vmx(vcpu)->nested.nested_run_pending)
> +			return 0;

Well, now I can see why you require this special 'nested_run_pending' flag
because there're places where L0 injects virtual interrupts right after
VMLAUNCH/VMRESUME emulation and before entering L2. :-)

> +		nested_vmx_vmexit(vcpu);
> +		vmcs12 = get_vmcs12(vcpu);
> +		vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT;
> +		vmcs12->vm_exit_intr_info = 0;
> +		/* fall through to normal code, but now in L1, not L2 */
> +	}
> +

This is a bad place to add this logic. vmx_interrupt_allowed is simply a
query function but you make it an absolute trigger point for switching from
L2 to L1. This is fine as now only point calling vmx_interrupt_allowed is
when there's vNMI pending. But it's dangerous to have such assumption
for pending events inside vmx_interrupt_allowed.

On the other hand, I think there's one area which is not handled timely.
I think you need to kick a L2->L1 transition when L0 wants to inject 
virtual interrupt. Consider your current logic:

a) L2 is running on cpu1
b) L0 on cpu 0 decides to post a virtual interrupt to L1. An IPI is issued to 
cpu1 after updating virqchip
c) L2 on cpu0 vmexit to L0, and checks whether L0 or L1 should handle
the event. As it's an external interrupt, L0 will handle it. As it's a notification
IPI, nothing is required.
d) L0 on cpu0 then decides to resume, and find KVM_REQ_EVENT

At this point you only add logic to enable_irq_window, but there's no 
action to trigger L2->L1 transition. So what will happen? Will the event
be injected into L2 instead or pend until next switch happens due to
other cause?

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nadav Har'El May 25, 2011, 10:56 a.m. UTC | #4
On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection":
> > If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
> > does), things look very different from the description above: L1 expects
> 
> Type-1 bare metal hypervisor may enable this bit, such as Xen. This bit is
> really prepared for L2 hypervisor since normally L2 hypervisor is tricky to
> touch generic interrupt logic, and thus better to not ack it until interrupt
> is enabled and then hardware will gear to the kernel interrupt handler
> automatically.

I have to be honest (and I was, in the patch set's introduction), this version
of nested VMX was only tested with a KVM L1. We've had VMWARE Server running
as a guest using a two-year-old branch of this code (as reported in our paper),
but this code has changed considerably since and it probably will not work
today. We've never tested Xen as L1.

I'll remove the emphasis on "no hypervisor that I know does", but the important
point remains: VM_EXIT_ACK_INTR_ON_EXIT is an optional feature, which we
report is *not* supported, so L1 should not attempt to use it.
As you said, it's not that difficult to support it, so we should, eventually,
but it's not a priority right now. It's on my Bugzilla.

> > +/*
> > + * In nested virtualization, check if L1 asked to exit on external interrupts.
> > + * For most existing hypervisors, this will always return true.
> > + */
> > +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
> > +{
> > +	return get_vmcs12(vcpu)->pin_based_vm_exec_control &
> > +		PIN_BASED_EXT_INTR_MASK;
> > +}
> > +
> 
> could be a similar common wrapper like nested_cpu_has...

I thought this made the callers easier to read, and didn't result in too much
code duplication, so I prefer it this way.
Nadav Har'El May 25, 2011, 12:33 p.m. UTC | #5
On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection":
> >  static void enable_irq_window(struct kvm_vcpu *vcpu)
> >  {
> >  	u32 cpu_based_vm_exec_control;
> > +	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
> > +		/* We can get here when nested_run_pending caused
> > +		 * vmx_interrupt_allowed() to return false. In this case, do
> > +		 * nothing - the interrupt will be injected later.
> > +		 */
> 
> I think this is not a rare path? when vcpu is in guest mode with L2 as current
> vmx context, this function could be invoked multiple times since kvm thread
> can be scheduled out/in randomly. 

As I wrote in this comment, this can only happen on nested_run_pending
(i.e., VMLAUNCH/VMRESUME emulation), because if !nested_run_pending,
and nested_exit_on_intr(), vmx_interrupt_allowed() would have already
exited L2, and we wouldn't be in this case.

I don't know if to classify this as a "rare" path - it's definitely not
very common. But what does it matter if it's rare or common?


> > +		if (to_vmx(vcpu)->nested.nested_run_pending)
> > +			return 0;
> 
> Well, now I can see why you require this special 'nested_run_pending' flag
> because there're places where L0 injects virtual interrupts right after
> VMLAUNCH/VMRESUME emulation and before entering L2. :-)

Indeed. I tried to explain that in the patch description, where I wrote

 We keep a new flag, "nested_run_pending", which can override the decision of
 which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
 L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
 and therefore expects L2 to be run (and perhaps be injected with an event it
 specified, etc.). Nested_run_pending is especially intended to avoid switching
 to L1 in the injection decision-point described above.

> > +		nested_vmx_vmexit(vcpu);
> > +		vmcs12 = get_vmcs12(vcpu);
> > +		vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT;
> > +		vmcs12->vm_exit_intr_info = 0;
> > +		/* fall through to normal code, but now in L1, not L2 */
> > +	}
> > +
> 
> This is a bad place to add this logic. vmx_interrupt_allowed is simply a
> query function but you make it an absolute trigger point for switching from
> L2 to L1. This is fine as now only point calling vmx_interrupt_allowed is
> when there's vNMI pending. But it's dangerous to have such assumption
> for pending events inside vmx_interrupt_allowed.

Now you're beating a dead horse ;-)

Gleb, and to some degree Avi, already argued that this is the wrong place
to do this exit, and if anything the exit should be done (or just decided on)
in enable_irq_window().

My counter-argument was that the right way is *neither* of these approaches -
any attempt to "commandeer" one of the existing x86 ops, be they
vmx_interrupt_allowed() or enable_irq_window() to do in the L2 case things
they were never designed to do is both ugly, and dangerous if the call sites
change at some time in the future.

So rather than changing one ugly abuse of one function, to the (arguably
also ugly) abuse of another function, what I'd like to see is a better overall
design, where the call sites in x86.c know about the possibility of a nested
guest (they already do - like we previously discussed, an is_guest_mode()
function was recently added), and when they need, *they* will call an
exit-to-L1 function, rather than calling a function called "enable_irq_window"
or "vmx_interrupt_allowed" which mysteriously will do the exit.


> On the other hand, I think there's one area which is not handled timely.
> I think you need to kick a L2->L1 transition when L0 wants to inject 
> virtual interrupt. Consider your current logic:
> 
> a) L2 is running on cpu1
> b) L0 on cpu 0 decides to post a virtual interrupt to L1. An IPI is issued to 
> cpu1 after updating virqchip
> c) L2 on cpu0 vmexit to L0, and checks whether L0 or L1 should handle
> the event. As it's an external interrupt, L0 will handle it. As it's a notification
> IPI, nothing is required.
> d) L0 on cpu0 then decides to resume, and find KVM_REQ_EVENT
> 
> At this point you only add logic to enable_irq_window, but there's no 
> action to trigger L2->L1 transition. So what will happen? Will the event
> be injected into L2 instead or pend until next switch happens due to
> other cause?

I'm afraid I'm missing something in your explanation... In step d, L0 finds
an interrupt in the injection queue, so isn't the first thing it does is to
call vmx_interrupt_allowed(), to check if injection is allowed now?
In our code, "vmx_interrupt_allowed()" was bastardized to exit to L1 in
this case. Isn't that the missing exit you were looking for?
Tian, Kevin May 25, 2011, 12:55 p.m. UTC | #6
> From: Nadav Har'El [mailto:nyh@math.technion.ac.il]
> Sent: Wednesday, May 25, 2011 8:34 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 23/31] nVMX:
> Correct handling of interrupt injection":
> > >  static void enable_irq_window(struct kvm_vcpu *vcpu)
> > >  {
> > >  	u32 cpu_based_vm_exec_control;
> > > +	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
> > > +		/* We can get here when nested_run_pending caused
> > > +		 * vmx_interrupt_allowed() to return false. In this case, do
> > > +		 * nothing - the interrupt will be injected later.
> > > +		 */
> >
> > I think this is not a rare path? when vcpu is in guest mode with L2 as current
> > vmx context, this function could be invoked multiple times since kvm thread
> > can be scheduled out/in randomly.
> 
> As I wrote in this comment, this can only happen on nested_run_pending
> (i.e., VMLAUNCH/VMRESUME emulation), because if !nested_run_pending,
> and nested_exit_on_intr(), vmx_interrupt_allowed() would have already
> exited L2, and we wouldn't be in this case.
> 
> I don't know if to classify this as a "rare" path - it's definitely not
> very common. But what does it matter if it's rare or common?

It doesn't matter much. I just tried to understand your comment.

> 
> 
> > > +		if (to_vmx(vcpu)->nested.nested_run_pending)
> > > +			return 0;
> >
> > Well, now I can see why you require this special 'nested_run_pending' flag
> > because there're places where L0 injects virtual interrupts right after
> > VMLAUNCH/VMRESUME emulation and before entering L2. :-)
> 
> Indeed. I tried to explain that in the patch description, where I wrote
> 
>  We keep a new flag, "nested_run_pending", which can override the decision
> of
>  which should run next, L1 or L2. nested_run_pending=1 means that we
> *must* run
>  L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
>  and therefore expects L2 to be run (and perhaps be injected with an event it
>  specified, etc.). Nested_run_pending is especially intended to avoid switching
>  to L1 in the injection decision-point described above.

atm when nested_run_pending is first introduced, its usage is simple which made
me think this field may not be required. But later several key patches do depend
on this flag for correctness. :-)

> 
> > > +		nested_vmx_vmexit(vcpu);
> > > +		vmcs12 = get_vmcs12(vcpu);
> > > +		vmcs12->vm_exit_reason =
> EXIT_REASON_EXTERNAL_INTERRUPT;
> > > +		vmcs12->vm_exit_intr_info = 0;
> > > +		/* fall through to normal code, but now in L1, not L2 */
> > > +	}
> > > +
> >
> > This is a bad place to add this logic. vmx_interrupt_allowed is simply a
> > query function but you make it an absolute trigger point for switching from
> > L2 to L1. This is fine as now only point calling vmx_interrupt_allowed is
> > when there's vNMI pending. But it's dangerous to have such assumption
> > for pending events inside vmx_interrupt_allowed.
> 
> Now you're beating a dead horse ;-)
> 
> Gleb, and to some degree Avi, already argued that this is the wrong place
> to do this exit, and if anything the exit should be done (or just decided on)
> in enable_irq_window().
> 
> My counter-argument was that the right way is *neither* of these approaches
> -
> any attempt to "commandeer" one of the existing x86 ops, be they
> vmx_interrupt_allowed() or enable_irq_window() to do in the L2 case things
> they were never designed to do is both ugly, and dangerous if the call sites
> change at some time in the future.
> 
> So rather than changing one ugly abuse of one function, to the (arguably
> also ugly) abuse of another function, what I'd like to see is a better overall
> design, where the call sites in x86.c know about the possibility of a nested
> guest (they already do - like we previously discussed, an is_guest_mode()
> function was recently added), and when they need, *they* will call an
> exit-to-L1 function, rather than calling a function called "enable_irq_window"
> or "vmx_interrupt_allowed" which mysteriously will do the exit.
> 

I agree with your point here.

> 
> > On the other hand, I think there's one area which is not handled timely.
> > I think you need to kick a L2->L1 transition when L0 wants to inject
> > virtual interrupt. Consider your current logic:
> >
> > a) L2 is running on cpu1
> > b) L0 on cpu 0 decides to post a virtual interrupt to L1. An IPI is issued to
> > cpu1 after updating virqchip
> > c) L2 on cpu0 vmexit to L0, and checks whether L0 or L1 should handle
> > the event. As it's an external interrupt, L0 will handle it. As it's a notification
> > IPI, nothing is required.
> > d) L0 on cpu0 then decides to resume, and find KVM_REQ_EVENT
> >
> > At this point you only add logic to enable_irq_window, but there's no
> > action to trigger L2->L1 transition. So what will happen? Will the event
> > be injected into L2 instead or pend until next switch happens due to
> > other cause?
> 
> I'm afraid I'm missing something in your explanation... In step d, L0 finds
> an interrupt in the injection queue, so isn't the first thing it does is to
> call vmx_interrupt_allowed(), to check if injection is allowed now?
> In our code, "vmx_interrupt_allowed()" was bastardized to exit to L1 in
> this case. Isn't that the missing exit you were looking for?
> 

This is a false alarm. In my earlier search I thought that vmx_interrupt_allowed
is only invoked in vmx.c for pending vNMI check which actually led me wonder
for a bigger problem. But actually this .interrupt_allowed is checked in common
path as expected. So my own problem here. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
@@ -1788,6 +1788,7 @@  static __init void nested_vmx_setup_ctls
 
 	/* exit controls */
 	nested_vmx_exit_ctls_low = 0;
+	/* Note that guest use of VM_EXIT_ACK_INTR_ON_EXIT is not supported. */
 #ifdef CONFIG_X86_64
 	nested_vmx_exit_ctls_high = VM_EXIT_HOST_ADDR_SPACE_SIZE;
 #else
@@ -3733,9 +3734,25 @@  out:
 	return ret;
 }
 
+/*
+ * In nested virtualization, check if L1 asked to exit on external interrupts.
+ * For most existing hypervisors, this will always return true.
+ */
+static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
+{
+	return get_vmcs12(vcpu)->pin_based_vm_exec_control &
+		PIN_BASED_EXT_INTR_MASK;
+}
+
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
 	u32 cpu_based_vm_exec_control;
+	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
+		/* We can get here when nested_run_pending caused
+		 * vmx_interrupt_allowed() to return false. In this case, do
+		 * nothing - the interrupt will be injected later.
+		 */
+		return;
 
 	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
@@ -3858,6 +3875,17 @@  static void vmx_set_nmi_mask(struct kvm_
 
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
+	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
+		struct vmcs12 *vmcs12;
+		if (to_vmx(vcpu)->nested.nested_run_pending)
+			return 0;
+		nested_vmx_vmexit(vcpu);
+		vmcs12 = get_vmcs12(vcpu);
+		vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT;
+		vmcs12->vm_exit_intr_info = 0;
+		/* fall through to normal code, but now in L1, not L2 */
+	}
+
 	return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
 		!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
 			(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
@@ -5545,6 +5573,14 @@  static int vmx_handle_exit(struct kvm_vc
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return handle_invalid_guest_state(vcpu);
 
+	/*
+	 * the KVM_REQ_EVENT optimization bit is only on for one entry, and if
+	 * we did not inject a still-pending event to L1 now because of
+	 * nested_run_pending, we need to re-enable this bit.
+	 */
+	if (vmx->nested.nested_run_pending)
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+
 	if (exit_reason == EXIT_REASON_VMLAUNCH ||
 	    exit_reason == EXIT_REASON_VMRESUME)
 		vmx->nested.nested_run_pending = 1;