diff mbox

event injection MACROs

Message ID 9832F13BD22FB94A829F798DA4A8280501B2762226@pdsmsx503.ccr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dong, Eddie May 14, 2009, 1:43 p.m. UTC
Avi Kivity wrote:
> Dong, Eddie wrote:
>> OK.
>> Also back to Gleb's question, the reason I want to do that is to
>> simplify event 
>> generation mechanism in current KVM.
>> 
>> Today KVM use additional layer of exception/nmi/interrupt such as
>> vcpu.arch.exception.pending, vcpu->arch.interrupt.pending &
>> vcpu->arch.nmi_injected. 
>> All those additional layer is due to compete of
>> VM_ENTRY_INTR_INFO_FIELD 
>> write to inject the event. Both SVM & VMX has only one resource to
>> inject the virtual event but KVM generates 3 catagory of events in
>> parallel which further requires additional 
>> logic to dictate among them.
> 
> I thought of using a queue to hold all pending events (in a common
> format), sort it by priority, and inject the head.

The SDM Table 5-4 requires to merge 2 events together, i.e. convert to #DF/
Triple fault or inject serially when 2 events happens no matter NMI, IRQ or exception.

As if considering above events merging activity, that is a single element queue.
 We could have either:  1) A pure SW "queue" that will be flush to HW 
register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW register.


A potential benefit is that it can avoid duplicated code and potential bugs
in current code as following patch shows if I understand correctly:

                return kvm_mmu_page_fault(vcpu, cr2, error_code);
        }

Either way are OK and up to you. BTW Xen uses HW register directly to representing
an pending event.

> 
>> One example is that exception has higher priority
>> than NMI/IRQ injection in current code which is not true in reality.
>> 
> 
> I don't think it matters in practice, since the guest will see it as a
> timing issue.  NMIs and IRQs are asynchronous (even those generated by
> the guest through the local APIC).

Yes. But also cause IRQ injection be delayed which may have side effect.
For example if guest exception handler is very longer or if guest VCPU fall into
recursive #GP. Within current logic, a guest IRQ event from KDB (IPI) running
on VCPU0, as an example, can't force the dead loop VCPU1 into KDB since it
is recursively #GP. 

> 
>> Another issue is that an failed event from previous injection say
>> IRQ or NMI may be discarded if an virtual exception happens in the
>> EXIT handling now. With the patch of generic double fault handling,
>> this case should be handled as normally. 
>> 
> 
> Discarding an exception is usually okay as it will be regenerated.  I
> don't think we discard interrupts or NMIs.
In reality (Running OS in guest), it doesn't happen so far. But architecturally, 
it could. For example KVM injects an IRQ, but VM Resume get #PF and 
back to KVM with IDT_VECTORING valid. Then KVM will put back the failed 
IRQ to interrupt queue. But if #PF handling generates another exception,
then the interrupt queue won't be able to be injected, since KVM inject 
exception first. And the interrupt queue is discarded at next VM Exit.

Overal, I think this is mostly for simplification but may benefit future
a lot. Especially with Gleb's recent cleanup, it soulds to be much easier to
do than before. 

I may make mistake here, will like to see more comments.
thx, eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Gleb Natapov May 14, 2009, 2:16 p.m. UTC | #1
On Thu, May 14, 2009 at 09:43:33PM +0800, Dong, Eddie wrote:
> Avi Kivity wrote:
> > Dong, Eddie wrote:
> >> OK.
> >> Also back to Gleb's question, the reason I want to do that is to
> >> simplify event 
> >> generation mechanism in current KVM.
> >> 
> >> Today KVM use additional layer of exception/nmi/interrupt such as
> >> vcpu.arch.exception.pending, vcpu->arch.interrupt.pending &
> >> vcpu->arch.nmi_injected. 
> >> All those additional layer is due to compete of
> >> VM_ENTRY_INTR_INFO_FIELD 
> >> write to inject the event. Both SVM & VMX has only one resource to
> >> inject the virtual event but KVM generates 3 catagory of events in
> >> parallel which further requires additional 
> >> logic to dictate among them.
> > 
> > I thought of using a queue to hold all pending events (in a common
> > format), sort it by priority, and inject the head.
> 
> The SDM Table 5-4 requires to merge 2 events together, i.e. convert to #DF/
> Triple fault or inject serially when 2 events happens no matter NMI, IRQ or exception.
> 
> As if considering above events merging activity, that is a single element queue.
I don't know how you got to this conclusion from you previous statement.
See explanation to table 5-2 for instate where it is stated that
interrupt should be held pending if there is exception with higher
priority. Should be held pending where? In the queue, like we do. Note
that low prio exceptions are just dropped since they will be regenerated.

>  We could have either:  1) A pure SW "queue" that will be flush to HW 
> register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW register.
> 
We have three event sources 1) exceptions 2) IRQ 3) NMI. We should have
queue of three elements sorted by priority. On each entry we should
inject an event with highest priority. And remove it from queue on exit.

> 
> A potential benefit is that it can avoid duplicated code and potential bugs
> in current code as following patch shows if I understand correctly:
> 
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2599,7 +2599,7 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct
> kvm_run *kvm_run)
>                 cr2 = vmcs_readl(EXIT_QUALIFICATION);
>                 KVMTRACE_3D(PAGE_FAULT, vcpu, error_code, (u32)cr2,
>                             (u32)((u64)cr2 >> 32), handler);
> -               if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending
> )
> +               if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending
>  || vcpu->arch.nmi_injected)
>                         kvm_mmu_unprotect_page_virt(vcpu, cr2);
>                 return kvm_mmu_page_fault(vcpu, cr2, error_code);
>         }
This fix is already in Avi's tree (not yet pushed).

> Either way are OK and up to you. BTW Xen uses HW register directly to representing
> an pending event.
> 
In this particular case I don't mind to use HW register either, but I
don't see any advantage.

> > 
> >> One example is that exception has higher priority
> >> than NMI/IRQ injection in current code which is not true in reality.
> >> 
> > 
> > I don't think it matters in practice, since the guest will see it as a
> > timing issue.  NMIs and IRQs are asynchronous (even those generated by
> > the guest through the local APIC).
> 
> Yes. But also cause IRQ injection be delayed which may have side effect.
> For example if guest exception handler is very longer or if guest VCPU fall into
> recursive #GP. Within current logic, a guest IRQ event from KDB (IPI) running
> on VCPU0, as an example, can't force the dead loop VCPU1 into KDB since it
> is recursively #GP.
If one #GP causes another #GP this is a #DF. If CPU has a chance to executes 
something in between KVM will have a chance to inject NMI.

> 
> > 
> >> Another issue is that an failed event from previous injection say
> >> IRQ or NMI may be discarded if an virtual exception happens in the
> >> EXIT handling now. With the patch of generic double fault handling,
> >> this case should be handled as normally. 
> >> 
> > 
> > Discarding an exception is usually okay as it will be regenerated.  I
> > don't think we discard interrupts or NMIs.
> In reality (Running OS in guest), it doesn't happen so far. But architecturally, 
> it could. For example KVM injects an IRQ, but VM Resume get #PF and 
> back to KVM with IDT_VECTORING valid. Then KVM will put back the failed 
> IRQ to interrupt queue. But if #PF handling generates another exception,
> then the interrupt queue won't be able to be injected, since KVM inject 
> exception first. And the interrupt queue is discarded at next VM Exit.
> 
I acknowledge the presence of the bug although I was not able to write a test case
to cause it yet, but it is easy to fix this without changing code too much. Unified event
queue and clearing of only injected event on exit should do the trick.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dong, Eddie May 14, 2009, 2:34 p.m. UTC | #2
Gleb Natapov wrote:
> On Thu, May 14, 2009 at 09:43:33PM +0800, Dong, Eddie wrote:
>> Avi Kivity wrote:
>>> Dong, Eddie wrote:
>>>> OK.
>>>> Also back to Gleb's question, the reason I want to do that is to
>>>> simplify event generation mechanism in current KVM.
>>>> 
>>>> Today KVM use additional layer of exception/nmi/interrupt such as
>>>> vcpu.arch.exception.pending, vcpu->arch.interrupt.pending &
>>>> vcpu->arch.nmi_injected. All those additional layer is due to
>>>> compete of VM_ENTRY_INTR_INFO_FIELD
>>>> write to inject the event. Both SVM & VMX has only one resource to
>>>> inject the virtual event but KVM generates 3 catagory of events in
>>>> parallel which further requires additional
>>>> logic to dictate among them.
>>> 
>>> I thought of using a queue to hold all pending events (in a common
>>> format), sort it by priority, and inject the head.
>> 
>> The SDM Table 5-4 requires to merge 2 events together, i.e. convert
>> to #DF/ 
>> Triple fault or inject serially when 2 events happens no matter NMI,
>> IRQ or exception. 
>> 
>> As if considering above events merging activity, that is a single
>> element queue. 
> I don't know how you got to this conclusion from you previous
> statement. 
> See explanation to table 5-2 for instate where it is stated that
> interrupt should be held pending if there is exception with higher
> priority. Should be held pending where? In the queue, like we do. Note
> that low prio exceptions are just dropped since they will be
> regenerated. 

I have different understanding here.
My understanding is that "held" means NO INTA in HW, i.e. LAPIC still hold this IRQ.

> 
>>  We could have either:  1) A pure SW "queue" that will be flush to HW
>> register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW register.
>> 
> We have three event sources 1) exceptions 2) IRQ 3) NMI. We should
> have 
> queue of three elements sorted by priority. On each entry we should

Table 5-4 alreadys says NMI/IRQ is BENIGN.

> inject an event with highest priority. And remove it from queue on
> exit. 

The problem is that we have to decide to inject only one of above 3, and discard the rest.
Whether priority them or merge (to one event as Table 5-4) is another story.

> 
>> 
>> A potential benefit is that it can avoid duplicated code and
>> potential bugs 
>> in current code as following patch shows if I understand correctly:
>> 
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2599,7 +2599,7 @@ static int handle_exception(struct kvm_vcpu
>>                 *vcpu, struct kvm_run *kvm_run) cr2 =
>>                 vmcs_readl(EXIT_QUALIFICATION);
>>                             KVMTRACE_3D(PAGE_FAULT, vcpu,
>> error_code, (u32)cr2, (u32)((u64)cr2 >> 32), handler); -            
>> if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending ) +
>>                         if (vcpu->arch.interrupt.pending ||
>>                 vcpu->arch.exception.pending  ||
>>         vcpu->arch.nmi_injected) kvm_mmu_unprotect_page_virt(vcpu,
>> cr2); return kvm_mmu_page_fault(vcpu, cr2, error_code); } 
> This fix is already in Avi's tree (not yet pushed).
> 
>> Either way are OK and up to you. BTW Xen uses HW register directly
>> to representing 
>> an pending event.
>> 
> In this particular case I don't mind to use HW register either, but I
> don't see any advantage.
> 
>>> 
>>>> One example is that exception has higher priority
>>>> than NMI/IRQ injection in current code which is not true in
>>>> reality. 
>>>> 
>>> 
>>> I don't think it matters in practice, since the guest will see it
>>> as a timing issue.  NMIs and IRQs are asynchronous (even those
>>> generated by the guest through the local APIC).
>> 
>> Yes. But also cause IRQ injection be delayed which may have side
>> effect. 
>> For example if guest exception handler is very longer or if guest
>> VCPU fall into recursive #GP. Within current logic, a guest IRQ
>> event from KDB (IPI) running 
>> on VCPU0, as an example, can't force the dead loop VCPU1 into KDB
>> since it 
>> is recursively #GP.
> If one #GP causes another #GP this is a #DF. If CPU has a chance to

Means another #GP in next instruction i.e. Beginning of #GP handler in guest.
No #DF here.

> executes 
> something in between KVM will have a chance to inject NMI.

Could have no chance in some cases though not very common.

> 
>> 
>>> 
>>>> Another issue is that an failed event from previous injection say
>>>> IRQ or NMI may be discarded if an virtual exception happens in the
>>>> EXIT handling now. With the patch of generic double fault handling,
>>>> this case should be handled as normally.
>>>> 
>>> 
>>> Discarding an exception is usually okay as it will be regenerated. 
>>> I don't think we discard interrupts or NMIs.
>> In reality (Running OS in guest), it doesn't happen so far. But
>> architecturally, 
>> it could. For example KVM injects an IRQ, but VM Resume get #PF and
>> back to KVM with IDT_VECTORING valid. Then KVM will put back the
>> failed 
>> IRQ to interrupt queue. But if #PF handling generates another
>> exception, 
>> then the interrupt queue won't be able to be injected, since KVM
>> inject 
>> exception first. And the interrupt queue is discarded at next VM
>> Exit. 
>> 
> I acknowledge the presence of the bug although I was not able to
> write a test case 
> to cause it yet, but it is easy to fix this without changing code too
> much. Unified event queue and clearing of only injected event on exit
> should do the trick. 

Yes, minor.

Eddie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov May 14, 2009, 3:44 p.m. UTC | #3
On Thu, May 14, 2009 at 10:34:11PM +0800, Dong, Eddie wrote:
> Gleb Natapov wrote:
> > On Thu, May 14, 2009 at 09:43:33PM +0800, Dong, Eddie wrote:
> >> Avi Kivity wrote:
> >>> Dong, Eddie wrote:
> >>>> OK.
> >>>> Also back to Gleb's question, the reason I want to do that is to
> >>>> simplify event generation mechanism in current KVM.
> >>>> 
> >>>> Today KVM use additional layer of exception/nmi/interrupt such as
> >>>> vcpu.arch.exception.pending, vcpu->arch.interrupt.pending &
> >>>> vcpu->arch.nmi_injected. All those additional layer is due to
> >>>> compete of VM_ENTRY_INTR_INFO_FIELD
> >>>> write to inject the event. Both SVM & VMX has only one resource to
> >>>> inject the virtual event but KVM generates 3 catagory of events in
> >>>> parallel which further requires additional
> >>>> logic to dictate among them.
> >>> 
> >>> I thought of using a queue to hold all pending events (in a common
> >>> format), sort it by priority, and inject the head.
> >> 
> >> The SDM Table 5-4 requires to merge 2 events together, i.e. convert
> >> to #DF/ 
> >> Triple fault or inject serially when 2 events happens no matter NMI,
> >> IRQ or exception. 
> >> 
> >> As if considering above events merging activity, that is a single
> >> element queue. 
> > I don't know how you got to this conclusion from you previous
> > statement. 
> > See explanation to table 5-2 for instate where it is stated that
> > interrupt should be held pending if there is exception with higher
> > priority. Should be held pending where? In the queue, like we do. Note
> > that low prio exceptions are just dropped since they will be
> > regenerated. 
> 
> I have different understanding here.
> My understanding is that "held" means NO INTA in HW, i.e. LAPIC still hold this IRQ.
> 
And what if INTA already happened and CPU is ready to fetch IDT for
interrupt vector and at this very moment CPU faults?

> > 
> >>  We could have either:  1) A pure SW "queue" that will be flush to HW
> >> register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW register.
> >> 
> > We have three event sources 1) exceptions 2) IRQ 3) NMI. We should
> > have 
> > queue of three elements sorted by priority. On each entry we should
> 
> Table 5-4 alreadys says NMI/IRQ is BENIGN.
Table 5-2 applies here not table 5-4 I think.

> 
> > inject an event with highest priority. And remove it from queue on
> > exit. 
> 
> The problem is that we have to decide to inject only one of above 3, and discard the rest.
> Whether priority them or merge (to one event as Table 5-4) is another story.
Only a small number of event are merged into #DF. Most handled serially
(SDM does not define what serially means unfortunately), so I don't
understand where "discard the rest" is come from. We can discard
exception since it will be regenerated anyway, but IRQ and NMI is
another story. SDM says that IRQ should be held pending (once again not
much explanation here), nothing about NMI.

> > 
> >> 
> >> A potential benefit is that it can avoid duplicated code and
> >> potential bugs 
> >> in current code as following patch shows if I understand correctly:
> >> 
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -2599,7 +2599,7 @@ static int handle_exception(struct kvm_vcpu
> >>                 *vcpu, struct kvm_run *kvm_run) cr2 =
> >>                 vmcs_readl(EXIT_QUALIFICATION);
> >>                             KVMTRACE_3D(PAGE_FAULT, vcpu,
> >> error_code, (u32)cr2, (u32)((u64)cr2 >> 32), handler); -            
> >> if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending ) +
> >>                         if (vcpu->arch.interrupt.pending ||
> >>                 vcpu->arch.exception.pending  ||
> >>         vcpu->arch.nmi_injected) kvm_mmu_unprotect_page_virt(vcpu,
> >> cr2); return kvm_mmu_page_fault(vcpu, cr2, error_code); } 
> > This fix is already in Avi's tree (not yet pushed).
> > 
> >> Either way are OK and up to you. BTW Xen uses HW register directly
> >> to representing 
> >> an pending event.
> >> 
> > In this particular case I don't mind to use HW register either, but I
> > don't see any advantage.
> > 
> >>> 
> >>>> One example is that exception has higher priority
> >>>> than NMI/IRQ injection in current code which is not true in
> >>>> reality. 
> >>>> 
> >>> 
> >>> I don't think it matters in practice, since the guest will see it
> >>> as a timing issue.  NMIs and IRQs are asynchronous (even those
> >>> generated by the guest through the local APIC).
> >> 
> >> Yes. But also cause IRQ injection be delayed which may have side
> >> effect. 
> >> For example if guest exception handler is very longer or if guest
> >> VCPU fall into recursive #GP. Within current logic, a guest IRQ
> >> event from KDB (IPI) running 
> >> on VCPU0, as an example, can't force the dead loop VCPU1 into KDB
> >> since it 
> >> is recursively #GP.
> > If one #GP causes another #GP this is a #DF. If CPU has a chance to
> 
> Means another #GP in next instruction i.e. Beginning of #GP handler in guest.
> No #DF here.
> 
In this case we will enter guest with "NMI windows open" request and
should exit immediately before first instruction of #GP handler. At this
moment KVM will be able to inject NMI.

> > executes 
> > something in between KVM will have a chance to inject NMI.
> 
> Could have no chance in some cases though not very common.
> 
> > 
> >> 
> >>> 
> >>>> Another issue is that an failed event from previous injection say
> >>>> IRQ or NMI may be discarded if an virtual exception happens in the
> >>>> EXIT handling now. With the patch of generic double fault handling,
> >>>> this case should be handled as normally.
> >>>> 
> >>> 
> >>> Discarding an exception is usually okay as it will be regenerated. 
> >>> I don't think we discard interrupts or NMIs.
> >> In reality (Running OS in guest), it doesn't happen so far. But
> >> architecturally, 
> >> it could. For example KVM injects an IRQ, but VM Resume get #PF and
> >> back to KVM with IDT_VECTORING valid. Then KVM will put back the
> >> failed 
> >> IRQ to interrupt queue. But if #PF handling generates another
> >> exception, 
> >> then the interrupt queue won't be able to be injected, since KVM
> >> inject 
> >> exception first. And the interrupt queue is discarded at next VM
> >> Exit. 
> >> 
> > I acknowledge the presence of the bug although I was not able to
> > write a test case 
> > to cause it yet, but it is easy to fix this without changing code too
> > much. Unified event queue and clearing of only injected event on exit
> > should do the trick. 
> 
> Yes, minor.
> 
> Eddie

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dong, Eddie May 15, 2009, 7:57 a.m. UTC | #4
Gleb Natapov wrote:
> On Thu, May 14, 2009 at 10:34:11PM +0800, Dong, Eddie wrote:
>> Gleb Natapov wrote:
>>> On Thu, May 14, 2009 at 09:43:33PM +0800, Dong, Eddie wrote:
>>>> Avi Kivity wrote:
>>>>> Dong, Eddie wrote:
>>>>>> OK.
>>>>>> Also back to Gleb's question, the reason I want to do that is to
>>>>>> simplify event generation mechanism in current KVM.
>>>>>> 
>>>>>> Today KVM use additional layer of exception/nmi/interrupt such as
>>>>>> vcpu.arch.exception.pending, vcpu->arch.interrupt.pending &
>>>>>> vcpu->arch.nmi_injected. All those additional layer is due to
>>>>>> compete of VM_ENTRY_INTR_INFO_FIELD
>>>>>> write to inject the event. Both SVM & VMX has only one resource
>>>>>> to inject the virtual event but KVM generates 3 catagory of
>>>>>> events in parallel which further requires additional
>>>>>> logic to dictate among them.
>>>>> 
>>>>> I thought of using a queue to hold all pending events (in a common
>>>>> format), sort it by priority, and inject the head.
>>>> 
>>>> The SDM Table 5-4 requires to merge 2 events together, i.e.
>>>> convert to #DF/ Triple fault or inject serially when 2 events
>>>> happens no matter NMI, IRQ or exception. 
>>>> 
>>>> As if considering above events merging activity, that is a single
>>>> element queue.
>>> I don't know how you got to this conclusion from you previous
>>> statement. See explanation to table 5-2 for instate where it is
>>> stated that interrupt should be held pending if there is exception
>>> with higher priority. Should be held pending where? In the queue,
>>> like we do. Note that low prio exceptions are just dropped since
>>> they will be regenerated.
>> 
>> I have different understanding here.
>> My understanding is that "held" means NO INTA in HW, i.e. LAPIC
>> still hold this IRQ. 
>> 
> And what if INTA already happened and CPU is ready to fetch IDT for
> interrupt vector and at this very moment CPU faults?

If INTA happens, that means it is delivered. If its delivery triggers another 
exception, that is what Table5-4 handles.

My understanding is that it is 2 stage process. Table 5-2 talk about 
events happening before delivery, so that HW needs to prioritize them. 
Once a decision is make, the highest one is delivered but then it could 
trigger another exception when fetching IDT etc.

Current execption.pending/interrupt.pending/nmi_injected doesn't match 
either of above, interrupt/nmi is only for failed event injection, and a strange
fixed priority check when it is really injected: 
exception > failed NMI > failed IRQ > new NMI > new IRQ.

Table 5-2 looks missed in current KVM IMO except a wrong (but minor)
 exception > NMI > IRQ sequence.

> 
>>> 
>>>>  We could have either:  1) A pure SW "queue" that will be flush to
>>>> HW register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW
>>>> register. 
>>>> 
>>> We have three event sources 1) exceptions 2) IRQ 3) NMI. We should
>>> have queue of three elements sorted by priority. On each entry we
>>> should 
>> 
>> Table 5-4 alreadys says NMI/IRQ is BENIGN.
> Table 5-2 applies here not table 5-4 I think.
> 
>> 
>>> inject an event with highest priority. And remove it from queue on
>>> exit.
>> 
>> The problem is that we have to decide to inject only one of above 3,
>> and discard the rest. Whether priority them or merge (to one event
>> as Table 5-4) is another story. 
> Only a small number of event are merged into #DF. Most handled
> serially (SDM does not define what serially means unfortunately), so
> I don't understand where "discard the rest" is come from. We can

vmx_complete_interrupts clear all of them at next EXIT.

Even from HW point of view, if there are pending NMI/IRQ/exception,
CPU pick highest one, NMI, ignore/discard IRQ (but LAPIC still holds 
IRQ, thus it can be re-injected), completely discard exception.

I don't say discarding has any problem, but unnecessary to keep all of 3.
the only difference is when to discard the rest 2, at queue_exception/irq/nmi 
time or later on (even at next EXIT time), which is same to me.

> discard exception since it will be regenerated anyway, but IRQ and
> NMI is another story. SDM says that IRQ should be held pending (once
> again not much explanation here), nothing about NMI.
> 
>>> 
>>>> 
>>>> A potential benefit is that it can avoid duplicated code and
>>>> potential bugs in current code as following patch shows if I
>>>> understand correctly: 
>>>> 
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -2599,7 +2599,7 @@ static int handle_exception(struct kvm_vcpu
>>>>                 *vcpu, struct kvm_run *kvm_run) cr2 =
>>>>                 vmcs_readl(EXIT_QUALIFICATION);
>>>>                             KVMTRACE_3D(PAGE_FAULT, vcpu,
>>>> error_code, (u32)cr2, (u32)((u64)cr2 >> 32), handler); -
>>>> if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending )
>>>>                         + if (vcpu->arch.interrupt.pending ||
>>>>                 vcpu->arch.exception.pending  ||
>>>>         vcpu->arch.nmi_injected) kvm_mmu_unprotect_page_virt(vcpu,
>>>> cr2); return kvm_mmu_page_fault(vcpu, cr2, error_code); }
>>> This fix is already in Avi's tree (not yet pushed).
>>> 
>>>> Either way are OK and up to you. BTW Xen uses HW register directly
>>>> to representing an pending event.
>>>> 
>>> In this particular case I don't mind to use HW register either, but
>>> I don't see any advantage. 
>>> 
>>>>> 
>>>>>> One example is that exception has higher priority
>>>>>> than NMI/IRQ injection in current code which is not true in
>>>>>> reality. 
>>>>>> 
>>>>> 
>>>>> I don't think it matters in practice, since the guest will see it
>>>>> as a timing issue.  NMIs and IRQs are asynchronous (even those
>>>>> generated by the guest through the local APIC).
>>>> 
>>>> Yes. But also cause IRQ injection be delayed which may have side
>>>> effect. For example if guest exception handler is very longer or
>>>> if guest VCPU fall into recursive #GP. Within current logic, a
>>>> guest IRQ event from KDB (IPI) running on VCPU0, as an example,
>>>> can't force the dead loop VCPU1 into KDB since it is recursively
>>>> #GP. 
>>> If one #GP causes another #GP this is a #DF. If CPU has a chance to
>> 
>> Means another #GP in next instruction i.e. Beginning of #GP handler
>> in guest. 
>> No #DF here.
>> 
> In this case we will enter guest with "NMI windows open" request and
> should exit immediately before first instruction of #GP handler. At
> this moment KVM will be able to inject NMI.

If the HW NMI windows is supported, it is fine, how about SW NMI case?
The flow will then look like:

Guest #GP instruction -> VM Exit -> Inject virtual #GP -> VMRESUME ->
try to execute 1st ins of guest #GP handler -> VM Exit again (#GP) -> 
inject virtual #GP -> .....

> 
>>> executes
>>> something in between KVM will have a chance to inject NMI.
>> 
>> Could have no chance in some cases though not very common.
>> 
>>> 
>>>> 
>>>>> 
>>>>>> Another issue is that an failed event from previous injection say
>>>>>> IRQ or NMI may be discarded if an virtual exception happens in
>>>>>> the EXIT handling now. With the patch of generic double fault
>>>>>> handling, this case should be handled as normally.
>>>>>> 
>>>>> 
>>>>> Discarding an exception is usually okay as it will be regenerated.
>>>>> I don't think we discard interrupts or NMIs.
>>>> In reality (Running OS in guest), it doesn't happen so far. But
>>>> architecturally, it could. For example KVM injects an IRQ, but VM
>>>> Resume get #PF and back to KVM with IDT_VECTORING valid. Then KVM
>>>> will put back the failed IRQ to interrupt queue. But if #PF
>>>> handling generates another exception, then the interrupt queue
>>>> won't be able to be injected, since KVM inject exception first.
>>>> And the interrupt queue is discarded at next VM Exit. 
>>>> 
>>> I acknowledge the presence of the bug although I was not able to
>>> write a test case to cause it yet, but it is easy to fix this
>>> without changing code too much. Unified event queue and clearing of
>>> only injected event on exit should do the trick.
>> 
>> Yes, minor.
>> 
>> Eddie

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov May 17, 2009, 9:44 a.m. UTC | #5
On Fri, May 15, 2009 at 03:57:44PM +0800, Dong, Eddie wrote:
> > And what if INTA already happened and CPU is ready to fetch IDT for
> > interrupt vector and at this very moment CPU faults?
> 
> If INTA happens, that means it is delivered. If its delivery triggers another 
> exception, that is what Table5-4 handles.
> 
Is this documented behaviour? Why not execute instruction in parallel
with INTA and IDT access?

> My understanding is that it is 2 stage process. Table 5-2 talk about 
> events happening before delivery, so that HW needs to prioritize them. 
> Once a decision is make, the highest one is delivered but then it could 
> trigger another exception when fetching IDT etc.
> 
> Current execption.pending/interrupt.pending/nmi_injected doesn't match 
> either of above, interrupt/nmi is only for failed event injection, and a strange
> fixed priority check when it is really injected: 
> exception > failed NMI > failed IRQ > new NMI > new IRQ.
The current code assumes that only one of failed NMI, failed IRQ or
exception can be true, So there is no prioritization at all between them.
Just checking which one is available. If there is no event that should
be re-injected then exception indeed is checked first, but, as Avi said,
it doesn't matter. It depends on timing and can't affect any guest code
and that is what really important. This is not the only place where KVM
timing is not the same as CPU timing BTW.

Pending exception field has two meanings though. It will be true either
for exception that failed injection or for newly generated exception,
but KVM is not CPU emulator and the second case happens very rarely
(only during emulation). Don't forget that most exceptions are handled
by CPU directly without KVM even knowing it. I wrote a test case that
generates #NP during NMI handling. I expected that after calling #NP
handler that fixes segment descriptor NMI handler will be called, but
this is not what happened on real HW. #NP was called, NMI was dropped,
but table 5-2 states that they should be handled serially. The good
thing it that the way KVM handles this situation now is the same as real
HW does, but if we will change code to do what SDM says KVM will behave
differently. I should do the same test with IRQ some day.

> 
> Table 5-2 looks missed in current KVM IMO except a wrong (but minor)
>  exception > NMI > IRQ sequence.
Doesn't matter. See above.

> 
> > 
> >>> 
> >>>>  We could have either:  1) A pure SW "queue" that will be flush to
> >>>> HW register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW
> >>>> register. 
> >>>> 
> >>> We have three event sources 1) exceptions 2) IRQ 3) NMI. We should
> >>> have queue of three elements sorted by priority. On each entry we
> >>> should 
> >> 
> >> Table 5-4 alreadys says NMI/IRQ is BENIGN.
> > Table 5-2 applies here not table 5-4 I think.
> > 
> >> 
> >>> inject an event with highest priority. And remove it from queue on
> >>> exit.
> >> 
> >> The problem is that we have to decide to inject only one of above 3,
> >> and discard the rest. Whether priority them or merge (to one event
> >> as Table 5-4) is another story. 
> > Only a small number of event are merged into #DF. Most handled
> > serially (SDM does not define what serially means unfortunately), so
> > I don't understand where "discard the rest" is come from. We can
> 
> vmx_complete_interrupts clear all of them at next EXIT.
> 
Because it is assumed that only one of them is true at the same time.

> Even from HW point of view, if there are pending NMI/IRQ/exception,
> CPU pick highest one, NMI, ignore/discard IRQ (but LAPIC still holds 
> IRQ, thus it can be re-injected), completely discard exception.
> 
For NMI/IRQ that what happens. If nmi_pending is true IRQ is held
pending in LAPIC (we don't even check IRQ is pending, who cares). For
exception ordering see above.

> I don't say discarding has any problem, but unnecessary to keep all of 3.
> the only difference is when to discard the rest 2, at queue_exception/irq/nmi 
> time or later on (even at next EXIT time), which is same to me.
> 
> > discard exception since it will be regenerated anyway, but IRQ and
> > NMI is another story. SDM says that IRQ should be held pending (once
> > again not much explanation here), nothing about NMI.
> > 
> >>> 
> >>>> 
> >>>> A potential benefit is that it can avoid duplicated code and
> >>>> potential bugs in current code as following patch shows if I
> >>>> understand correctly: 
> >>>> 
> >>>> --- a/arch/x86/kvm/vmx.c
> >>>> +++ b/arch/x86/kvm/vmx.c
> >>>> @@ -2599,7 +2599,7 @@ static int handle_exception(struct kvm_vcpu
> >>>>                 *vcpu, struct kvm_run *kvm_run) cr2 =
> >>>>                 vmcs_readl(EXIT_QUALIFICATION);
> >>>>                             KVMTRACE_3D(PAGE_FAULT, vcpu,
> >>>> error_code, (u32)cr2, (u32)((u64)cr2 >> 32), handler); -
> >>>> if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending )
> >>>>                         + if (vcpu->arch.interrupt.pending ||
> >>>>                 vcpu->arch.exception.pending  ||
> >>>>         vcpu->arch.nmi_injected) kvm_mmu_unprotect_page_virt(vcpu,
> >>>> cr2); return kvm_mmu_page_fault(vcpu, cr2, error_code); }
> >>> This fix is already in Avi's tree (not yet pushed).
> >>> 
> >>>> Either way are OK and up to you. BTW Xen uses HW register directly
> >>>> to representing an pending event.
> >>>> 
> >>> In this particular case I don't mind to use HW register either, but
> >>> I don't see any advantage. 
> >>> 
> >>>>> 
> >>>>>> One example is that exception has higher priority
> >>>>>> than NMI/IRQ injection in current code which is not true in
> >>>>>> reality. 
> >>>>>> 
> >>>>> 
> >>>>> I don't think it matters in practice, since the guest will see it
> >>>>> as a timing issue.  NMIs and IRQs are asynchronous (even those
> >>>>> generated by the guest through the local APIC).
> >>>> 
> >>>> Yes. But also cause IRQ injection be delayed which may have side
> >>>> effect. For example if guest exception handler is very longer or
> >>>> if guest VCPU fall into recursive #GP. Within current logic, a
> >>>> guest IRQ event from KDB (IPI) running on VCPU0, as an example,
> >>>> can't force the dead loop VCPU1 into KDB since it is recursively
> >>>> #GP. 
> >>> If one #GP causes another #GP this is a #DF. If CPU has a chance to
> >> 
> >> Means another #GP in next instruction i.e. Beginning of #GP handler
> >> in guest. 
> >> No #DF here.
> >> 
> > In this case we will enter guest with "NMI windows open" request and
> > should exit immediately before first instruction of #GP handler. At
> > this moment KVM will be able to inject NMI.
> 
> If the HW NMI windows is supported, it is fine, how about SW NMI case?
There is no SW NMI case. There are buggy CPUs that does not provide
proper NMI injection support. There is a hack to implement kinda working
NMI on such CPUs, but there are enough scenarios that will not work correctly
to not rely on this hack if your guest uses NMI for something serious.

> The flow will then look like:
> 
> Guest #GP instruction -> VM Exit -> Inject virtual #GP -> VMRESUME ->
> try to execute 1st ins of guest #GP handler -> VM Exit again (#GP) -> 
> inject virtual #GP -> .....
No. This will look like: Guest #GP instruction -> guest #GP handler ->
try to execute 1st ins of guest #GP handler -> Guest #GP -> guest #GP
handler -> something sends NMI -> vcpu_kick() -> VM Exit due to kick()
-> inject NMI or the next VM entry -> everything works event on cpus
with "SW NMI". KVM is not emulator. Avi can you rename the project to
KINE please.


--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2599,7 +2599,7 @@  static int handle_exception(struct kvm_vcpu *vcpu, struct
kvm_run *kvm_run)
                cr2 = vmcs_readl(EXIT_QUALIFICATION);
                KVMTRACE_3D(PAGE_FAULT, vcpu, error_code, (u32)cr2,
                            (u32)((u64)cr2 >> 32), handler);
-               if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending
)
+               if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending
 || vcpu->arch.nmi_injected)
                        kvm_mmu_unprotect_page_virt(vcpu, cr2);
                return kvm_mmu_page_fault(vcpu, cr2, error_code);
        }


If using above merged SW "queue" or HW direct register, we can do like following:

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2599,7 +2599,7 @@  static int handle_exception(struct kvm_vcpu *vcpu, struct
kvm_run *kvm_run)
                cr2 = vmcs_readl(EXIT_QUALIFICATION);
                KVMTRACE_3D(PAGE_FAULT, vcpu, error_code, (u32)cr2,
                            (u32)((u64)cr2 >> 32), handler);
-               if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending
)
+               if (vmcs_read(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK)
                        kvm_mmu_unprotect_page_virt(vcpu, cr2);