diff mbox

[PART1,RFC,5/9] svm: Add VMEXIT handlers for AVIC

Message ID 1455285574-27892-6-git-send-email-suravee.suthikulpanit@amd.com (mailing list archive)
State New, archived
Headers show

Commit Message

Suravee Suthikulpanit Feb. 12, 2016, 1:59 p.m. UTC
Introduce VMEXIT handlers, avic_incp_ipi_interception() and
avic_noaccel_interception().

Signed-off-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
---
 arch/x86/include/uapi/asm/svm.h |   9 +-
 arch/x86/kvm/svm.c              | 241 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 249 insertions(+), 1 deletion(-)

Comments

Paolo Bonzini Feb. 12, 2016, 3:38 p.m. UTC | #1
On 12/02/2016 14:59, Suravee Suthikulpanit wrote:
> +		 "icrh:icrl=%#010x:%08x, id=%u, index=%u\n",
> +		 __func__, svm->vcpu.cpu, svm->vcpu.vcpu_id,
> +		 icrh, icrl, id, index);
> +
> +	switch (id) {
> +	case AVIC_INCMP_IPI_ERR_INVALID_INT_TYPE:
> +		/*
> +		 * AVIC hardware handles the generation of
> +		 * IPIs when the specified Message Type is Fixed
> +		 * (also known as fixed delivery mode) and
> +		 * the Trigger Mode is edge-triggered. The hardware
> +		 * also supports self and broadcast delivery modes
> +		 * specified via the Destination Shorthand(DSH)
> +		 * field of the ICRL. Logical and physical APIC ID
> +		 * formats are supported. All other IPI types cause
> +		 * a #VMEXIT, which needs to emulated.
> +		 */
> +		kvm_lapic_reg_write(apic, APIC_ICR2, icrh);
> +		kvm_lapic_reg_write(apic, APIC_ICR, icrl);
> +		break;
> +	case AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN:
> +		kvm_lapic_reg_write(apic, APIC_ICR2, icrh);
> +		kvm_lapic_reg_write(apic, APIC_ICR, icrl);

Wouldn't this cause a double injection of the IPI if the following happens:

1) destination 1 is running, so the processor sets IRR and sends a
doorbell message

2) destination 2 is not running, so the processor sets IRR and exits

3) destination 1 processes the interrupt, moving it from IRR to ISR

4) destination 1 sends an EOI

5) the source exits and reinjects the interrupt

6) destination 1 then receives the interrupt again.


Or alternatively:

1) destination 1 is not running, so the processor sets IRR and exits

2) another CPU executes VMRUN for destination 1, so the processor
injects the interrupt

3) destination 1 sends an EOI

4) the source exits and reinjects the interrupt

5) destination 1 then receives the interrupt again.


The handling of races for IsRunning and incomplete IPIs has always been
very confusing to me whenever I read the AVIC specification.  It would
be great if you could clarify this.

Paolo

> +		break;
> +	case AVIC_INCMP_IPI_ERR_INV_TARGET:
> +		pr_err("SVM: %s: Invalid IPI target (icr=%#08x:%08x, idx=%u)\n",
> +		       __func__, icrh, icrl, index);
> +		BUG();
> +		break;
> +	case AVIC_INCMP_IPI_ERR_INV_BK_PAGE:
> +		pr_err("SVM: %s: Invalid bk page (icr=%#08x:%08x, idx=%u)\n",
> +		       __func__, icrh, icrl, index);
> +		BUG();
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Feb. 15, 2016, 7:22 p.m. UTC | #2
2016-02-12 16:38+0100, Paolo Bonzini:
> On 12/02/2016 14:59, Suravee Suthikulpanit wrote:
>> +	case AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN:
>> +		kvm_lapic_reg_write(apic, APIC_ICR2, icrh);
>> +		kvm_lapic_reg_write(apic, APIC_ICR, icrl);
> 
> Wouldn't this cause a double injection of the IPI if the following happens:

I think it will.  (IRR was written to APIC pages, so hypervisor's only
job is to make sure that all targeted VCPUs eventually run.)

> The handling of races for IsRunning and incomplete IPIs has always been
> very confusing to me whenever I read the AVIC specification.  It would
> be great if you could clarify this.

Yeah, we bug there as well:  If all target VCPUs have IsRunning set and
are in the process of being scheduled out (avic_set_running false), then
there is no VMEXIT on IPI and the doorbell does nothing[1];  KVM desn't
re-check pending interrupts before actually scheduling out, therefore
VCPUs will wake only if another interrupt comes.

The hypervisor can manage the IsRunning as it wishes to, so KVM probably
should set IsRunning to false before scanning IRR.


---
1: I didn't find a single mention of a situation when doorbell arrives
   outside of guest mode, so I presume that nothing happens.
   Is it right?

   Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Suravee Suthikulpanit Feb. 16, 2016, 6:29 a.m. UTC | #3
Hi Paolo,

On 2/12/16 22:38, Paolo Bonzini wrote:
>
>
> On 12/02/2016 14:59, Suravee Suthikulpanit wrote:
>> +		 "icrh:icrl=%#010x:%08x, id=%u, index=%u\n",
>> +		 __func__, svm->vcpu.cpu, svm->vcpu.vcpu_id,
>> +		 icrh, icrl, id, index);
>> +
>> +	switch (id) {
>> +	case AVIC_INCMP_IPI_ERR_INVALID_INT_TYPE:
>> +		/*
>> +		 * AVIC hardware handles the generation of
>> +		 * IPIs when the specified Message Type is Fixed
>> +		 * (also known as fixed delivery mode) and
>> +		 * the Trigger Mode is edge-triggered. The hardware
>> +		 * also supports self and broadcast delivery modes
>> +		 * specified via the Destination Shorthand(DSH)
>> +		 * field of the ICRL. Logical and physical APIC ID
>> +		 * formats are supported. All other IPI types cause
>> +		 * a #VMEXIT, which needs to emulated.
>> +		 */
>> +		kvm_lapic_reg_write(apic, APIC_ICR2, icrh);
>> +		kvm_lapic_reg_write(apic, APIC_ICR, icrl);
>> +		break;
>> +	case AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN:
>> +		kvm_lapic_reg_write(apic, APIC_ICR2, icrh);
>> +		kvm_lapic_reg_write(apic, APIC_ICR, icrl);
>
> Wouldn't this cause a double injection of the IPI if the following happens:
>
> 1) destination 1 is running, so the processor sets IRR and sends a
> doorbell message
>
> 2) destination 2 is not running, so the processor sets IRR and exits
>
> 3) destination 1 processes the interrupt, moving it from IRR to ISR
>
> 4) destination 1 sends an EOI
>
> 5) the source exits and reinjects the interrupt
>
> 6) destination 1 then receives the interrupt again.

Not sure if I am following your scenario here.  IIUC, your concern is 
regarding the dest2 that was not running at the time that the IPI 
message is sent to both dest1 and dest2?

In this case, since the HW cannot deliver due to one ore more target 
vcpus due to not running, I believe it would not set the IRR bit of 
dest1, and generate the AVIC_INCOMPLETE_IPI #vmexit above instead. I 
don't think it would progress to step 3 right away.

>
>
> Or alternatively:
>
> 1) destination 1 is not running, so the processor sets IRR and exits
>
> 2) another CPU executes VMRUN for destination 1, so the processor
> injects the interrupt
>
> 3) destination 1 sends an EOI
>
> 4) the source exits and reinjects the interrupt
>
> 5) destination 1 then receives the interrupt again.

Same here, I don't think the HW would set the IRR of the dest1. Instead, 
it would generate the #VMEXIT.

> The handling of races for IsRunning and incomplete IPIs has always been
> very confusing to me whenever I read the AVIC specification.  It would
> be great if you could clarify this.

I'll make sure to confirm with the HW designer again just to be sure.

Thanks,
Suravee

> Paolo
>
>> +		break;
>> +	case AVIC_INCMP_IPI_ERR_INV_TARGET:
>> +		pr_err("SVM: %s: Invalid IPI target (icr=%#08x:%08x, idx=%u)\n",
>> +		       __func__, icrh, icrl, index);
>> +		BUG();
>> +		break;
>> +	case AVIC_INCMP_IPI_ERR_INV_BK_PAGE:
>> +		pr_err("SVM: %s: Invalid bk page (icr=%#08x:%08x, idx=%u)\n",
>> +		       __func__, icrh, icrl, index);
>> +		BUG();
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Feb. 16, 2016, 12:15 p.m. UTC | #4
On 16/02/2016 07:29, Suravee Suthikulpanit wrote:
> Hi Paolo,
> 
> On 2/12/16 22:38, Paolo Bonzini wrote:
>>
>>
>> On 12/02/2016 14:59, Suravee Suthikulpanit wrote:
>>> +         "icrh:icrl=%#010x:%08x, id=%u, index=%u\n",
>>> +         __func__, svm->vcpu.cpu, svm->vcpu.vcpu_id,
>>> +         icrh, icrl, id, index);
>>> +
>>> +    switch (id) {
>>> +    case AVIC_INCMP_IPI_ERR_INVALID_INT_TYPE:
>>> +        /*
>>> +         * AVIC hardware handles the generation of
>>> +         * IPIs when the specified Message Type is Fixed
>>> +         * (also known as fixed delivery mode) and
>>> +         * the Trigger Mode is edge-triggered. The hardware
>>> +         * also supports self and broadcast delivery modes
>>> +         * specified via the Destination Shorthand(DSH)
>>> +         * field of the ICRL. Logical and physical APIC ID
>>> +         * formats are supported. All other IPI types cause
>>> +         * a #VMEXIT, which needs to emulated.
>>> +         */
>>> +        kvm_lapic_reg_write(apic, APIC_ICR2, icrh);
>>> +        kvm_lapic_reg_write(apic, APIC_ICR, icrl);
>>> +        break;
>>> +    case AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN:
>>> +        kvm_lapic_reg_write(apic, APIC_ICR2, icrh);
>>> +        kvm_lapic_reg_write(apic, APIC_ICR, icrl);
>>
>> Wouldn't this cause a double injection of the IPI if the following
>> happens:
>>
>> 1) destination 1 is running, so the processor sets IRR and sends a
>> doorbell message
>>
>> 2) destination 2 is not running, so the processor sets IRR and exits
>>
>> 3) destination 1 processes the interrupt, moving it from IRR to ISR
>>
>> 4) destination 1 sends an EOI
>>
>> 5) the source exits and reinjects the interrupt
>>
>> 6) destination 1 then receives the interrupt again.
> 
> Not sure if I am following your scenario here.  IIUC, your concern is
> regarding the dest2 that was not running at the time that the IPI
> message is sent to both dest1 and dest2?
> 
> In this case, since the HW cannot deliver due to one ore more target
> vcpus due to not running, I believe it would not set the IRR bit of
> dest1, and generate the AVIC_INCOMPLETE_IPI #vmexit above instead. I
> don't think it would progress to step 3 right away.

The documentation doesn't say that setting the IRR bit is atomic across
all CPUs (and from a hardware perspective that would be extremely
unlikely).  Another hint in my opinion is that the vmexit is called
"incomplete" IPI, not something like "aborted" IPI.  "abort" might
suggest atomicity, "incomplete" definitely suggests *non*atomicity to me.

Wei, what do you think/recall?

I am afraid that this could be a showstopper for AVIC support in KVM.
The only solution I see is to have a different page for each CPU, so
that only self-IPIs are virtualized.  Then you'd only support
virtualization of self-IPIs, similar to Intel's APICv.

>> The handling of races for IsRunning and incomplete IPIs has always been
>> very confusing to me whenever I read the AVIC specification.  It would
>> be great if you could clarify this.
> 
> I'll make sure to confirm with the HW designer again just to be sure.

Please do, thanks!

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Feb. 16, 2016, 2:13 p.m. UTC | #5
2016-02-16 13:15+0100, Paolo Bonzini:
> On 16/02/2016 07:29, Suravee Suthikulpanit wrote:
>> On 2/12/16 22:38, Paolo Bonzini wrote:
>>> On 12/02/2016 14:59, Suravee Suthikulpanit wrote:
>>>> +    case AVIC_INCMP_IPI_ERR_INVALID_INT_TYPE:
| [...]
>>>> +    case AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN:
>>>> +        kvm_lapic_reg_write(apic, APIC_ICR2, icrh);
>>>> +        kvm_lapic_reg_write(apic, APIC_ICR, icrl);
>>>
>>> Wouldn't this cause a double injection of the IPI if the following
>>> happens:
>>>
>>> 1) destination 1 is running, so the processor sets IRR and sends a
>>> doorbell message
>>>
>>> 2) destination 2 is not running, so the processor sets IRR and exits
>>>
>>> 3) destination 1 processes the interrupt, moving it from IRR to ISR
>>>
>>> 4) destination 1 sends an EOI
>>>
>>> 5) the source exits and reinjects the interrupt
>>>
>>> 6) destination 1 then receives the interrupt again.
>> 
>> Not sure if I am following your scenario here.  IIUC, your concern is
>> regarding the dest2 that was not running at the time that the IPI
>> message is sent to both dest1 and dest2?
>> 
>> In this case, since the HW cannot deliver due to one ore more target
>> vcpus due to not running, I believe it would not set the IRR bit of
>> dest1, and generate the AVIC_INCOMPLETE_IPI #vmexit above instead. I
>> don't think it would progress to step 3 right away.

(AVIC handles logical interrupts and broadcasts, so there is IPI to two
 destinations that won't exit with AVIC_INCMP_IPI_ERR_INVALID_INT_TYPE.)

I assume that the whole step must be completed before a subsequent step
is started, so either all IRRs are written or none is:

 3,4: If any IPI destination is not valid, then AVIC exits with
      !AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN before any IRR is written.
      (Broadcast ignores invalid, which is ok.)
 5: AVIC will set IRR and maybe send doorbell on all valid destinations.
 6: If doorbell wasn't sent to all valid destinations, exit with
    AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN

(Relevant bit of the spec is at the bottom.)

> The documentation doesn't say that setting the IRR bit is atomic across
> all CPUs (and from a hardware perspective that would be extremely
> unlikely).

Yeah, I think atomic there means that it won't race with other writes to
the same byte in IRR.  We're fine as long as AVIC writes IRR before
checking IsRunning on every destination, which it seems to be.

>             Another hint in my opinion is that the vmexit is called
> "incomplete" IPI, not something like "aborted" IPI.  "abort" might
> suggest atomicity, "incomplete" definitely suggests *non*atomicity to me.
> 
> Wei, what do you think/recall?
> 
> I am afraid that this could be a showstopper for AVIC support in KVM.

(It would, but I believe that AVIC designers made it sane and the spec
 doesn't let me read it in a way that supports your theories.)

> The only solution I see is to have a different page for each CPU, so
> that only self-IPIs are virtualized.  Then you'd only support
> virtualization of self-IPIs, similar to Intel's APICv.

We need one APIC page per VCPU and a pair (physical, logical) of APIC ID
tables per VM, but per CPU structures shouldn't be necessary.

>> I'll make sure to confirm with the HW designer again just to be sure.
> 
> Please do, thanks!

Thanks, looking forward to it!

We definitely have incompatible understanding of some aspects. :)


---
APM 2, June 2015, 15.29.2.4 Interrupt Delivery:

  Interprocessor Interrupts. To process an IPI, AVIC hardware executes
  the following steps:
  [...]
  3. If the destination(s) is (are) logically addressed, lookup the
     guest physical APIC IDs for each logical ID using the Logical APIC
     ID table.  If the entry is not valid (V bit is cleared), cause a
     #VMEXIT.  If the entry is valid, but contains an invalid backing
     page pointer, cause a #VMEXIT.

  4. Lookup the vAPIC backing page address in the Physical APIC table using
     the guest physical APIC ID as an index into the table.  For
     directed interrupts, if the selected table entry is not valid,
     cause a #VMEXIT. For broadcast IPIs, invalid entries are ignored.

  5. For every valid destination:
     - Atomically set the appropriate IRR bit in each of the
       destinations’ vAPIC backing page.
     - Check the IsRunning status of each destination.
     - If the destination IsRunning bit is set, send a doorbell message
       using the host physical core number from the Physical APIC ID
       table.

  6. If any destinations are identified as not currently scheduled on a
     physical core (i.e., the IsRunning bit for that virtual processor
     is not set), cause a #VMEXIT.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Feb. 16, 2016, 4:56 p.m. UTC | #6
On 16/02/2016 15:13, Radim Kr?má? wrote:
> Yeah, I think atomic there means that it won't race with other writes to
> the same byte in IRR.  We're fine as long as AVIC writes IRR before
> checking IsRunning on every destination, which it seems to be.

More precisely, if AVIC writes all IRRs (5.1) and ANDs all IsRunning
flags before checking the result of the AND (6).

> (It would, but I believe that AVIC designers made it sane and the spec
>  doesn't let me read it in a way that supports your theories.)

I hope so as well, and you've probably convinced me.  But I still think
the code is wrong in this patch.  Let's look at the spec that you pasted:

>   3. If the destination(s) is (are) logically addressed, lookup the
>      guest physical APIC IDs for each logical ID using the Logical APIC
>      ID table.  If the entry is not valid (V bit is cleared), cause a
>      #VMEXIT.  If the entry is valid, but contains an invalid backing
>      page pointer, cause a #VMEXIT.
> 
>   4. Lookup the vAPIC backing page address in the Physical APIC table using
>      the guest physical APIC ID as an index into the table.  For
>      directed interrupts, if the selected table entry is not valid,
>      cause a #VMEXIT. For broadcast IPIs, invalid entries are ignored.
> 
>   5. For every valid destination:
>      - Atomically set the appropriate IRR bit in each of the
>        destinations’ vAPIC backing page.
>      - Check the IsRunning status of each destination.
>      - If the destination IsRunning bit is set, send a doorbell message
>        using the host physical core number from the Physical APIC ID
>        table.

This is where the following steps happen:

1) destination 1 is running, so the processor sets IRR and sends a
doorbell message

2) destination 2 is a valid destination, so the processor sets IRR


In the meanwhile destination 1 is running on another VCPU so we can say
that it does the following:

3) destination 1 processes the interrupt, moving it from IRR to ISR

4) destination 1 sends an EOI


>   6. If any destinations are identified as not currently scheduled on a
>      physical core (i.e., the IsRunning bit for that virtual processor
>      is not set), cause a #VMEXIT.

Now the following happens:

5) the source exits and reinjects the interrupt (in Suravee's code, the
VMEXIT handler just writes again to ICR);

6) the KVM code has no way to know that destination 1 has serviced the
interrupt already, so destination 1 then receives the interrupt again.

So perhaps it's enough to change KVM to _not_ modify IRR on an
"incomplete IPI - target not running" vmexit, and instead only do

       kvm_make_request(KVM_REQ_EVENT, vcpu);
       kvm_vcpu_kick(vcpu);

on the destination VCPUs.  That would indeed be simply just be something
to fix in the patches.  Do you agree that this is a bug?

I'm curious about how often the AVIC VMEXIT fires.  Suravee, can you add
debugfs counters for the various incomplete IPI subcauses?


And since we are at it, I'm curious about the following two steps at the
end of 15.29.2.6.

- on VMRUN the interrupt state is evaluated and the highest priority
pending interrupt indicated in the IRR is delivered if interrupt masking
and priority allow

- Any doorbell signals received during VMRUN processing are recognized
immediately after entering the guest

Isn't step 1 exactly the same as evaluating the doorbell signals?  Is
the IRR evaluated only if the hypervisor had rang the doorbell, or
unconditionally?

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Feb. 16, 2016, 6:06 p.m. UTC | #7
2016-02-16 17:56+0100, Paolo Bonzini:
> On 16/02/2016 15:13, Radim Kr?má? wrote:
>> Yeah, I think atomic there means that it won't race with other writes to
>> the same byte in IRR.  We're fine as long as AVIC writes IRR before
>> checking IsRunning on every destination, which it seems to be.
> 
> More precisely, if AVIC writes all IRRs (5.1) and ANDs all IsRunning
> flags before checking the result of the AND (6).
> 
>> (It would, but I believe that AVIC designers made it sane and the spec
>>  doesn't let me read it in a way that supports your theories.)
> 
> I hope so as well, and you've probably convinced me.  But I still think
> the code is wrong in this patch.  Let's look at the spec that you pasted:

The code definitely is wrong.  I'll be more specific when disagreeing,
sorry.

> This is where the following steps happen:

  [I completely agree with the race presented here.]

> So perhaps it's enough to change KVM to _not_ modify IRR on an
> "incomplete IPI - target not running" vmexit, and instead only do
> 
>        kvm_make_request(KVM_REQ_EVENT, vcpu);
>        kvm_vcpu_kick(vcpu);
> 
> on the destination VCPUs.  That would indeed be simply just be something
> to fix in the patches.  Do you agree that this is a bug?

Yes.  (We don't even need KVM_REQ_EVENT, because there should be nothing
to do, KVM just has to run the guest.)

> I'm curious about how often the AVIC VMEXIT fires.

From a theoretical standpoint:

AVIC_INCMP_IPI_ERR_INVALID_INT_TYPE:  Not much; OS usually doesn't send
lowest priority IPIs (it's not even supported on Intel), NMI, INIT, ...
and the rest seems to be handled.

AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN: depends a lot on host load (and what
the guest does); most IPIs will trigger this on an over-committed host.

AVIC_INCMP_IPI_ERR_INV_TARGET: Almost never; only on guest OS bugs,
where the guest can trigger if it targets non-existing VCPUs.
(Btw. calling BUG() there is a bug.)

AVIC_INCMP_IPI_ERR_INV_BK_PAGE: It's a bug in KVM, so hopefully never.

>                                                     Suravee, can you add
> debugfs counters for the various incomplete IPI subcauses?

Good point, large value in any of those would point to a problem.

> And since we are at it, I'm curious about the following two steps at the
> end of 15.29.2.6.
> 
> - on VMRUN the interrupt state is evaluated and the highest priority
> pending interrupt indicated in the IRR is delivered if interrupt masking
> and priority allow
> 
> - Any doorbell signals received during VMRUN processing are recognized
> immediately after entering the guest
> 
> Isn't step 1 exactly the same as evaluating the doorbell signals?

It is.

>                                                                    Is
> the IRR evaluated only if the hypervisor had rang the doorbell, or
> unconditionally?

Unconditionally.
(Supporting evidence: current code doesn't send doorbell when the VCPU
 is in host mode and I suppose that it works fine. :])

I think these two clauses cover a race on VMRUN:
when processing VMRUN, we might not consider the CPU to be in guest
mode, so these two disambiguate a case when VMRUN has already checked
for IRR (it was empty) and other CPU set IRR and issued doorbell before
VMRUN entered the guest.  (The doorbell could be considered as lost
otherwise, because doorbells in host mode do nothing.)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Suravee Suthikulpanit Feb. 18, 2016, 2:25 a.m. UTC | #8
Hi

On 2/17/16 01:06, Radim Kr?má? wrote:
> 2016-02-16 17:56+0100, Paolo Bonzini:
>> >On 16/02/2016 15:13, Radim Kr?má? wrote:
>>> >>Yeah, I think atomic there means that it won't race with other writes to
>>> >>the same byte in IRR.  We're fine as long as AVIC writes IRR before
>>> >>checking IsRunning on every destination, which it seems to be.
>> >
>> >More precisely, if AVIC writes all IRRs (5.1) and ANDs all IsRunning
>> >flags before checking the result of the AND (6).
>> >
>>> >>(It would, but I believe that AVIC designers made it sane and the spec
>>> >>  doesn't let me read it in a way that supports your theories.)
>> >
>> >I hope so as well, and you've probably convinced me.  But I still think
>> >the code is wrong in this patch.  Let's look at the spec that you pasted:
> The code definitely is wrong.  I'll be more specific when disagreeing,
> sorry.
>

Would you please be a bit more specific on what you think I am not doing 
correctly to handle the #VMEXIT in the case of target not running below.

+    case AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN:
+        kvm_lapic_reg_write(apic, APIC_ICR2, icrh);
+        kvm_lapic_reg_write(apic, APIC_ICR, icrl);

This is actually not just writing to the register. Please note that 
writing to APIC_ICR register would also be calling apic_send_ipi(), 
which results in injecting interrupts to the target core:

From: arch/x86/kvm/lapic.c: kvm_lapic_write_reg()
[....]
         case APIC_ICR:
                 /* No delay here, so we always clear the pending bit */
                 apic_set_reg(apic, APIC_ICR, val & ~(1 << 12));
                 apic_send_ipi(apic);
                 break;

         case APIC_ICR2:
                 if (!apic_x2apic_mode(apic))
                         val &= 0xff000000;
                 apic_set_reg(apic, APIC_ICR2, val);
                 break;

Am I missing something?

Thanks,
Suravee
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Feb. 18, 2016, 2:18 p.m. UTC | #9
2016-02-18 09:25+0700, Suravee Suthikulpanit:
> On 2/17/16 01:06, Radim Kr?má? wrote:
>>2016-02-16 17:56+0100, Paolo Bonzini:
>>>>On 16/02/2016 15:13, Radim Kr?má? wrote:
>>>>>>Yeah, I think atomic there means that it won't race with other writes to
>>>>>>the same byte in IRR.  We're fine as long as AVIC writes IRR before
>>>>>>checking IsRunning on every destination, which it seems to be.
>>>>
>>>>More precisely, if AVIC writes all IRRs (5.1) and ANDs all IsRunning
>>>>flags before checking the result of the AND (6).
>>>>
>>>>>>(It would, but I believe that AVIC designers made it sane and the spec
>>>>>>  doesn't let me read it in a way that supports your theories.)
>>>>
>>>>I hope so as well, and you've probably convinced me.  But I still think
>>>>the code is wrong in this patch.  Let's look at the spec that you pasted:
>>The code definitely is wrong.  I'll be more specific when disagreeing,
>>sorry.
>>
> 
> Would you please be a bit more specific on what you think I am not doing
> correctly to handle the #VMEXIT in the case of target not running below.
> 
> +    case AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN:
> +        kvm_lapic_reg_write(apic, APIC_ICR2, icrh);
> +        kvm_lapic_reg_write(apic, APIC_ICR, icrl);
> 
> This is actually not just writing to the register. Please note that writing
> to APIC_ICR register would also be calling apic_send_ipi(), which results in
> injecting interrupts to the target core:

Exactly.  Injecting the interrupt in AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN
handler is causing the double-injection bug that Paolo described.

> Am I missing something?

Probably that AVIC already wrote to all IRRs (and sent appropriate
doorbells) before this VMEXIT, so KVM shouldn't repeat it.

KVM just has to make sure that targeted VCPUs notice the interrupt,
which means to kick (wake up) VCPUs that don't have IsRunning set.
There is no need to do anything with running VCPUs, because they
 - are in guest mode and noticed the doorbell
 - are in host mode, where they will
   1) VMRUN as fast as they can because the VCPU didn't want to halt
      (and IRR is handled on VMRUN)
   2) check IRR after unsetting IsRunning and goto (1) if there are
      pending interrupts.  (RFC doesn't do this, which is another bug)

It's still possible that we misunderstood the spec.  Does AVIC handle
IPIs differently?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Feb. 18, 2016, 2:51 p.m. UTC | #10
On 18/02/2016 15:18, Radim Kr?má? wrote:
> KVM just has to make sure that targeted VCPUs notice the interrupt,
> which means to kick (wake up) VCPUs that don't have IsRunning set.
> There is no need to do anything with running VCPUs, because they
>  - are in guest mode and noticed the doorbell
>  - are in host mode, where they will
>    1) VMRUN as fast as they can because the VCPU didn't want to halt
>       (and IRR is handled on VMRUN)
>    2) check IRR after unsetting IsRunning and goto (1) if there are
>       pending interrupts.  (RFC doesn't do this, which is another bug)

This is not necessary.  IsRunning is only cleared at vcpu_put time.  The
next KVM_RUN will look at IRR (kvm_arch_vcpu_runnable), if necessary set
the mp_state to KVM_MP_STATE_RUNNABLE, and do the VMRUN.

But I agree that this is what Suravee is missing.

> It's still possible that we misunderstood the spec.  Does AVIC handle
> IPIs differently?

I don't think we misunderstood it.  Well, I did, but that's fixed now. :)

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Feb. 18, 2016, 3:43 p.m. UTC | #11
2016-02-18 15:51+0100, Paolo Bonzini:
> On 18/02/2016 15:18, Radim Kr?má? wrote:
>> KVM just has to make sure that targeted VCPUs notice the interrupt,
>> which means to kick (wake up) VCPUs that don't have IsRunning set.
>> There is no need to do anything with running VCPUs, because they
>>  - are in guest mode and noticed the doorbell
>>  - are in host mode, where they will
>>    1) VMRUN as fast as they can because the VCPU didn't want to halt
>>       (and IRR is handled on VMRUN)
>>    2) check IRR after unsetting IsRunning and goto (1) if there are
>>       pending interrupts.  (RFC doesn't do this, which is another bug)
> 
> This is not necessary.  IsRunning is only cleared at vcpu_put time.

It's not necessary if we are being preempted, but it is necessary to
clear IsRunning earlier when we are going to block (i.e. after a halt).

>                                                                      The
> next KVM_RUN will look at IRR (kvm_arch_vcpu_runnable), if necessary set
> the mp_state to KVM_MP_STATE_RUNNABLE, and do the VMRUN.

We're not always going to exit to userspace.  I think the following
order of actions could result in a stuck VM:

(The main idea is that VCPU targets another VCPU between its last check
 for IRR and clearing of IsRunning.)

1) vcpu0 has set IsRunning and is in guest mode.
2) vcpu0 executes HLT and exits guest mode.
3) vcpu0 doesn't have any pending interrupts or other stuff that would
   make kvm_arch_vcpu_runnable() return true.
4) vcpu0 runs kvm_vcpu_block() and gets to call schedule().
5) *vcpu1* sends IPI to vcpu0.  IsRunning is set on vcpu0, so AVIC
   doesn't exit.  A doorbell is sent, but it does nothing.
6) vcpu0 runs schedule() and clears IsRunning in a callback.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Feb. 18, 2016, 3:53 p.m. UTC | #12
On 18/02/2016 16:43, Radim Kr?má? wrote:
> 2016-02-18 15:51+0100, Paolo Bonzini:
>> On 18/02/2016 15:18, Radim Kr?má? wrote:
>>> KVM just has to make sure that targeted VCPUs notice the interrupt,
>>> which means to kick (wake up) VCPUs that don't have IsRunning set.
>>> There is no need to do anything with running VCPUs, because they
>>>  - are in guest mode and noticed the doorbell
>>>  - are in host mode, where they will
>>>    1) VMRUN as fast as they can because the VCPU didn't want to halt
>>>       (and IRR is handled on VMRUN)
>>>    2) check IRR after unsetting IsRunning and goto (1) if there are
>>>       pending interrupts.  (RFC doesn't do this, which is another bug)
>>
>> This is not necessary.  IsRunning is only cleared at vcpu_put time.
> 
> It's not necessary if we are being preempted, but it is necessary to
> clear IsRunning earlier when we are going to block (i.e. after a halt).
> 
>>                                                                      The
>> next KVM_RUN will look at IRR (kvm_arch_vcpu_runnable), if necessary set
>> the mp_state to KVM_MP_STATE_RUNNABLE, and do the VMRUN.
> 
> We're not always going to exit to userspace.  I think the following
> order of actions could result in a stuck VM:
> 
> (The main idea is that VCPU targets another VCPU between its last check
>  for IRR and clearing of IsRunning.)
> 
> 1) vcpu0 has set IsRunning and is in guest mode.
> 2) vcpu0 executes HLT and exits guest mode.
> 3) vcpu0 doesn't have any pending interrupts or other stuff that would
>    make kvm_arch_vcpu_runnable() return true.
> 4) vcpu0 runs kvm_vcpu_block() and gets to call schedule().
> 5) *vcpu1* sends IPI to vcpu0.  IsRunning is set on vcpu0, so AVIC
>    doesn't exit.  A doorbell is sent, but it does nothing.
> 6) vcpu0 runs schedule() and clears IsRunning in a callback.

You're completely right.  When the VCPU is halting, the preempt
notifier's sched_out callback is too late to clear IsRunning; you need
to do that before the last time you check for IRR.

Patch 9 is okay, but it is also necessary to clear IsRunning in
kvm_arch_vcpu_blocking and set it in kvm_arch_vcpu_unblocking.  In
addition, vcpu_put/vcpu_load should not modify IsRunning between
kvm_arch_vcpu_blocking and kvm_arch_vcpu_unblocking.  Do you agree?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Feb. 18, 2016, 4:27 p.m. UTC | #13
2016-02-18 16:53+0100, Paolo Bonzini:
> Patch 9 is okay, but it is also necessary to clear IsRunning in
> kvm_arch_vcpu_blocking and set it in kvm_arch_vcpu_unblocking.  In
> addition, vcpu_put/vcpu_load should not modify IsRunning between
> kvm_arch_vcpu_blocking and kvm_arch_vcpu_unblocking.  Do you agree?

Yes.

I think we don't need to clear IsRunning on preemption, which would
simplify the protection.  (I haven't thought much about userspace exit,
so maybe we could skip that one as well, but we don't need to now.)

The reason for that is that KVM knows that the VCPU was scheduled out,
so it couldn't do much in the AVIC VMEXIT.
(KVM could force scheduler to pritioritize the VCPU, but our kick
 doesn't do that now and it seems like a bad idea.)

Does it seem reasonable?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Feb. 18, 2016, 5:18 p.m. UTC | #14
On 18/02/2016 17:27, Radim Kr?má? wrote:
> 2016-02-18 16:53+0100, Paolo Bonzini:
>> Patch 9 is okay, but it is also necessary to clear IsRunning in
>> kvm_arch_vcpu_blocking and set it in kvm_arch_vcpu_unblocking.  In
>> addition, vcpu_put/vcpu_load should not modify IsRunning between
>> kvm_arch_vcpu_blocking and kvm_arch_vcpu_unblocking.  Do you agree?
> 
> Yes.
> 
> I think we don't need to clear IsRunning on preemption, which would
> simplify the protection.  (I haven't thought much about userspace exit,
> so maybe we could skip that one as well, but we don't need to now.)
> 
> The reason for that is that KVM knows that the VCPU was scheduled out,
> so it couldn't do much in the AVIC VMEXIT.
> (KVM could force scheduler to pritioritize the VCPU, but our kick
>  doesn't do that now and it seems like a bad idea.)
> 
> Does it seem reasonable?

Yes, and in fact it wouldn't need to clear and set IsRunning on
vcpu_put/vcpu_load; only on vcpu_blocking/vcpu_unblocking.

The IsRunning flag is more of a IsNotHalted flag, in the end.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Suravee Suthikulpanit Feb. 19, 2016, 11:32 a.m. UTC | #15
Hi,

On 2/18/16 21:18, Radim Kr?má? wrote:
> 2016-02-18 09:25+0700, Suravee Suthikulpanit:
>> On 2/17/16 01:06, Radim Kr?má? wrote:
>>> 2016-02-16 17:56+0100, Paolo Bonzini:
>>>>> On 16/02/2016 15:13, Radim Kr?má? wrote:
>>>>>>> Yeah, I think atomic there means that it won't race with other writes to
>>>>>>> the same byte in IRR.  We're fine as long as AVIC writes IRR before
>>>>>>> checking IsRunning on every destination, which it seems to be.
>>>>>
>>>>> More precisely, if AVIC writes all IRRs (5.1) and ANDs all IsRunning
>>>>> flags before checking the result of the AND (6).
>>>>>
>>>>>>> (It would, but I believe that AVIC designers made it sane and the spec
>>>>>>>   doesn't let me read it in a way that supports your theories.)
>>>>>
>>>>> I hope so as well, and you've probably convinced me.  But I still think
>>>>> the code is wrong in this patch.  Let's look at the spec that you pasted:
>>> The code definitely is wrong.  I'll be more specific when disagreeing,
>>> sorry.
>>>
>>
>> Would you please be a bit more specific on what you think I am not doing
>> correctly to handle the #VMEXIT in the case of target not running below.
>>
>> +    case AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN:
>> +        kvm_lapic_reg_write(apic, APIC_ICR2, icrh);
>> +        kvm_lapic_reg_write(apic, APIC_ICR, icrl);
>>
>> This is actually not just writing to the register. Please note that writing
>> to APIC_ICR register would also be calling apic_send_ipi(), which results in
>> injecting interrupts to the target core:
>
> Exactly.  Injecting the interrupt in AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN
> handler is causing the double-injection bug that Paolo described.
>
>> Am I missing something?
>
> Probably that AVIC already wrote to all IRRs (and sent appropriate
> doorbells) before this VMEXIT, so KVM shouldn't repeat it.

Ah, Ok I got it now. Thanks for the detail description. I am still 
waiting to hear back from the hardware designer to confirm the HW 
behavior. Meanwhile, I have tried NOT setting the IRR, and only 
kick_vcpu(). And things seem to work fine. Therefore, I think your 
analysis is likely to be correct.

Thanks again,
Suravee
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Suravee Suthikulpanit Feb. 19, 2016, 11:39 a.m. UTC | #16
Hi,

On 2/19/16 00:18, Paolo Bonzini wrote:
>
>
> On 18/02/2016 17:27, Radim Kr?má? wrote:
>> 2016-02-18 16:53+0100, Paolo Bonzini:
>>> Patch 9 is okay, but it is also necessary to clear IsRunning in
>>> kvm_arch_vcpu_blocking and set it in kvm_arch_vcpu_unblocking.  In
>>> addition, vcpu_put/vcpu_load should not modify IsRunning between
>>> kvm_arch_vcpu_blocking and kvm_arch_vcpu_unblocking.  Do you agree?
>>
>> Yes.
>>
>> I think we don't need to clear IsRunning on preemption, which would
>> simplify the protection.  (I haven't thought much about userspace exit,
>> so maybe we could skip that one as well, but we don't need to now.)
>>
>> The reason for that is that KVM knows that the VCPU was scheduled out,
>> so it couldn't do much in the AVIC VMEXIT.
>> (KVM could force scheduler to pritioritize the VCPU, but our kick
>>   doesn't do that now and it seems like a bad idea.)
>>
>> Does it seem reasonable?
>
> Yes, and in fact it wouldn't need to clear and set IsRunning on
> vcpu_put/vcpu_load; only on vcpu_blocking/vcpu_unblocking.
>
> The IsRunning flag is more of a IsNotHalted flag, in the end.
>
> Paolo
>

Good point. I have made the change by introducing new function pointer, 
kvm_x86_ops.vcpu_blocking() and kvm_x86_ops.vcpu_unblocking(). Then 
provides the hook to set/unset the IsRunningBit here. Also, I no longer 
set the bit in the vcpu_load/vcpu_put.

If this is okay. I'll send out V2 soon.

Thanks,
Suravee
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Feb. 19, 2016, 11:44 a.m. UTC | #17
On 19/02/2016 12:39, Suravee Suthikulpanit wrote:
> 
> Good point. I have made the change by introducing new function pointer,
> kvm_x86_ops.vcpu_blocking() and kvm_x86_ops.vcpu_unblocking(). Then
> provides the hook to set/unset the IsRunningBit here. Also, I no longer
> set the bit in the vcpu_load/vcpu_put.
> 
> If this is okay. I'll send out V2 soon.

Great, thanks.  Have you tried making the backing page point to the
kvm_lapic's regs?  If so, I think you're ready to send out v2 indeed.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Suravee Suthikulpanit Feb. 19, 2016, 11:59 a.m. UTC | #18
Hi

On 2/19/16 18:44, Paolo Bonzini wrote:
>
>
> On 19/02/2016 12:39, Suravee Suthikulpanit wrote:
>>
>> Good point. I have made the change by introducing new function pointer,
>> kvm_x86_ops.vcpu_blocking() and kvm_x86_ops.vcpu_unblocking(). Then
>> provides the hook to set/unset the IsRunningBit here. Also, I no longer
>> set the bit in the vcpu_load/vcpu_put.
>>
>> If this is okay. I'll send out V2 soon.
>
> Great, thanks.  Have you tried making the backing page point to the
> kvm_lapic's regs?  If so, I think you're ready to send out v2 indeed.
>
> Paolo
>

Yes, this also done.  It works quite well. I was just about to reply to 
you regarding this change. (Having trouble looking for the related 
message in the thread :) ).

Thanks,
Suravee
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Suravee Suthikulpanit March 3, 2016, 10:42 a.m. UTC | #19
Hi

On 02/19/2016 12:18 AM, Paolo Bonzini wrote:
>
>
> On 18/02/2016 17:27, Radim Kr?má? wrote:
>> 2016-02-18 16:53+0100, Paolo Bonzini:
>>> Patch 9 is okay, but it is also necessary to clear IsRunning in
>>> kvm_arch_vcpu_blocking and set it in kvm_arch_vcpu_unblocking.  In
>>> addition, vcpu_put/vcpu_load should not modify IsRunning between
>>> kvm_arch_vcpu_blocking and kvm_arch_vcpu_unblocking.  Do you agree?
>>
>> Yes.
>>
>> I think we don't need to clear IsRunning on preemption, which would
>> simplify the protection.  (I haven't thought much about userspace exit,
>> so maybe we could skip that one as well, but we don't need to now.)
>>
>> The reason for that is that KVM knows that the VCPU was scheduled out,
>> so it couldn't do much in the AVIC VMEXIT.
>> (KVM could force scheduler to pritioritize the VCPU, but our kick
>>   doesn't do that now and it seems like a bad idea.)
>>
>> Does it seem reasonable?
>
> Yes, and in fact it wouldn't need to clear and set IsRunning on
> vcpu_put/vcpu_load; only on vcpu_blocking/vcpu_unblocking.
>
> The IsRunning flag is more of a IsNotHalted flag, in the end.
>
> Paolo
>

In facts, instead of setting up the vAPIC backing page address when 
calling kvm_arch_vcpu_load(), we should be able to do it when calling 
kvm_arch_vcpu_sched_in(). This seems more appropriate since the 
kvm_arch_vcpu_load() is also called in many unnecessary occasions via 
vcpu_load() (in the arch/x86/kvm/x86.c). The same goes for the 
kvm_arch_vcpu_put().

However, there is no kvm_arch_vcpu_sched_out(). But that can be added 
easily.

What do you think?

Suravee



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini March 3, 2016, 10:50 a.m. UTC | #20
On 03/03/2016 11:42, Suravee Suthikulpanit wrote:
> In facts, instead of setting up the vAPIC backing page address when
> calling kvm_arch_vcpu_load(), we should be able to do it when calling
> kvm_arch_vcpu_sched_in(). This seems more appropriate since the
> kvm_arch_vcpu_load() is also called in many unnecessary occasions via
> vcpu_load() (in the arch/x86/kvm/x86.c). The same goes for the
> kvm_arch_vcpu_put().
> 
> However, there is no kvm_arch_vcpu_sched_out(). But that can be added
> easily.
> 
> What do you think?

How much code is involved?  It seems an unnecessary complication...
Unless you can profile a difference, I would leave it in
kvm_arch_vcpu_load()/kvm_arch_vcpu_put().

Note that sched_in and sched_out are not called once per invocation of
KVM_RUN---only through the preempt notifiers.  So I'm not sure it is
enough to use sched_in and sched_out.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
index 8a4add8..ebfdf8d 100644
--- a/arch/x86/include/uapi/asm/svm.h
+++ b/arch/x86/include/uapi/asm/svm.h
@@ -73,6 +73,8 @@ 
 #define SVM_EXIT_MWAIT_COND    0x08c
 #define SVM_EXIT_XSETBV        0x08d
 #define SVM_EXIT_NPF           0x400
+#define SVM_EXIT_AVIC_INCMP_IPI 0x401
+#define SVM_EXIT_AVIC_NOACCEL  0x402
 
 #define SVM_EXIT_ERR           -1
 
@@ -107,8 +109,10 @@ 
 	{ SVM_EXIT_SMI,         "smi" }, \
 	{ SVM_EXIT_INIT,        "init" }, \
 	{ SVM_EXIT_VINTR,       "vintr" }, \
+	{ SVM_EXIT_CR0_SEL_WRITE, "cr0_sec_write" }, \
 	{ SVM_EXIT_CPUID,       "cpuid" }, \
 	{ SVM_EXIT_INVD,        "invd" }, \
+	{ SVM_EXIT_PAUSE,       "pause" }, \
 	{ SVM_EXIT_HLT,         "hlt" }, \
 	{ SVM_EXIT_INVLPG,      "invlpg" }, \
 	{ SVM_EXIT_INVLPGA,     "invlpga" }, \
@@ -127,7 +131,10 @@ 
 	{ SVM_EXIT_MONITOR,     "monitor" }, \
 	{ SVM_EXIT_MWAIT,       "mwait" }, \
 	{ SVM_EXIT_XSETBV,      "xsetbv" }, \
-	{ SVM_EXIT_NPF,         "npf" }
+	{ SVM_EXIT_NPF,         "npf" }, \
+	{ SVM_EXIT_RSM,         "rsm" }, \
+	{ SVM_EXIT_AVIC_INCMP_IPI, "avic_incomp_ipi" }, \
+	{ SVM_EXIT_AVIC_NOACCEL,   "avic_noaccel" }
 
 
 #endif /* _UAPI__SVM_H */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 9440b48..bedf52b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3667,6 +3667,245 @@  static int mwait_interception(struct vcpu_svm *svm)
 	return nop_interception(svm);
 }
 
+enum avic_incmp_ipi_err_code {
+	AVIC_INCMP_IPI_ERR_INVALID_INT_TYPE,
+	AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN,
+	AVIC_INCMP_IPI_ERR_INV_TARGET,
+	AVIC_INCMP_IPI_ERR_INV_BK_PAGE,
+};
+
+static int avic_incomp_ipi_interception(struct vcpu_svm *svm)
+{
+	u32 icrh = svm->vmcb->control.exit_info_1 >> 32;
+	u32 icrl = svm->vmcb->control.exit_info_1;
+	u32 id = svm->vmcb->control.exit_info_2 >> 32;
+	u32 index = svm->vmcb->control.exit_info_2 && 0xFF;
+	struct kvm_lapic *apic = svm->vcpu.arch.apic;
+
+	pr_debug("SVM: %s: cpu=%#x, vcpu=%#x, "
+		 "icrh:icrl=%#010x:%08x, id=%u, index=%u\n",
+		 __func__, svm->vcpu.cpu, svm->vcpu.vcpu_id,
+		 icrh, icrl, id, index);
+
+	switch (id) {
+	case AVIC_INCMP_IPI_ERR_INVALID_INT_TYPE:
+		/*
+		 * AVIC hardware handles the generation of
+		 * IPIs when the specified Message Type is Fixed
+		 * (also known as fixed delivery mode) and
+		 * the Trigger Mode is edge-triggered. The hardware
+		 * also supports self and broadcast delivery modes
+		 * specified via the Destination Shorthand(DSH)
+		 * field of the ICRL. Logical and physical APIC ID
+		 * formats are supported. All other IPI types cause
+		 * a #VMEXIT, which needs to emulated.
+		 */
+		kvm_lapic_reg_write(apic, APIC_ICR2, icrh);
+		kvm_lapic_reg_write(apic, APIC_ICR, icrl);
+		break;
+	case AVIC_INCMP_IPI_ERR_TARGET_NOT_RUN:
+		kvm_lapic_reg_write(apic, APIC_ICR2, icrh);
+		kvm_lapic_reg_write(apic, APIC_ICR, icrl);
+		break;
+	case AVIC_INCMP_IPI_ERR_INV_TARGET:
+		pr_err("SVM: %s: Invalid IPI target (icr=%#08x:%08x, idx=%u)\n",
+		       __func__, icrh, icrl, index);
+		BUG();
+		break;
+	case AVIC_INCMP_IPI_ERR_INV_BK_PAGE:
+		pr_err("SVM: %s: Invalid bk page (icr=%#08x:%08x, idx=%u)\n",
+		       __func__, icrh, icrl, index);
+		BUG();
+		break;
+	default:
+		pr_err("SVM: Unknown IPI interception\n");
+	}
+
+	return 1;
+}
+
+static int avic_noaccel_trap_write(struct vcpu_svm *svm)
+{
+	u32 offset = svm->vmcb->control.exit_info_1 & 0xFF0;
+	struct svm_vm_data *vm_data = svm->vcpu.kvm->arch.arch_data;
+	struct kvm_lapic *apic = svm->vcpu.arch.apic;
+	u32 reg = *avic_get_bk_page_entry(svm, offset);
+
+	pr_debug("SVN: %s: offset=%#x, val=%#x, (cpu=%x) (vcpu_id=%x)\n",
+		 __func__, offset, reg, svm->vcpu.cpu, svm->vcpu.vcpu_id);
+
+	switch (offset) {
+	case APIC_ID: {
+		u32 aid = (reg >> 24) & 0xff;
+		struct svm_avic_phy_ait_entry *o_ent =
+				avic_get_phy_ait_entry(&svm->vcpu, svm->vcpu.vcpu_id);
+		struct svm_avic_phy_ait_entry *n_ent =
+				avic_get_phy_ait_entry(&svm->vcpu, aid);
+
+		if (!n_ent || !o_ent)
+			return 0;
+
+		pr_debug("SVN: %s: APIC_ID=%#x (id=%x)\n", __func__, reg, aid);
+
+		/* We need to move phy_apic_entry to new offset */
+		*n_ent = *o_ent;
+		*((u64 *)o_ent) = 0ULL;
+		break;
+	}
+	case APIC_LDR: {
+		int ret, lid;
+		int dlid = (reg >> 24) & 0xff;
+
+		if (!dlid)
+			return 0;
+
+		lid = ffs(dlid) - 1;
+		pr_debug("SVM: %s: LDR=%0#10x (lid=%x)\n", __func__, reg, lid);
+		ret = avic_init_log_apic_entry(&svm->vcpu, svm->vcpu.vcpu_id,
+					       lid);
+		if (ret)
+			return 0;
+
+		break;
+	}
+	case APIC_DFR: {
+		u32 mod = (*avic_get_bk_page_entry(svm, offset) >> 28) & 0xf;
+
+		pr_debug("SVM: %s: DFR=%#x (%s)\n", __func__,
+			 mod, mod == 0xf? "flat": "cluster");
+
+		/*
+		 * We assume that all local APICs are using the same type.
+		 * If this changes, we need to rebuild the AVIC logical
+		 * APID id table with subsequent write to APIC_LDR.
+		 */
+		if (vm_data->ldr_mode != mod) {
+			clear_page(page_address(vm_data->avic_log_ait_page));
+			vm_data->ldr_mode = mod;
+		}
+		break;
+	}
+	case APIC_TMICT: {
+		u32 val = kvm_apic_get_reg(apic, APIC_TMICT);
+
+		pr_debug("SVM: %s: TMICT=%#x,%#x\n", __func__, val, reg);
+		break;
+	}
+	case APIC_ESR: {
+		u32 val = kvm_apic_get_reg(apic, APIC_ESR);
+
+		pr_debug("SVM: %s: ESR=%#x,%#x\n", __func__, val, reg);
+		break;
+	}
+	case APIC_LVTERR: {
+		u32 val = kvm_apic_get_reg(apic, APIC_LVTERR);
+
+		pr_debug("SVM: %s: LVTERR=%#x,%#x\n", __func__, val, reg);
+		break;
+	}
+	default:
+		break;
+	}
+
+	kvm_lapic_reg_write(apic, offset, reg);
+
+	return 1;
+}
+
+static int avic_noaccel_fault_read(struct vcpu_svm *svm)
+{
+	u32 val;
+	u32 offset = svm->vmcb->control.exit_info_1 & 0xFF0;
+	struct kvm_lapic *apic = svm->vcpu.arch.apic;
+
+	pr_debug("SVM: %s: offset=%x\n", __func__, offset);
+
+	switch (offset) {
+	case APIC_TMCCT: {
+		if (kvm_lapic_reg_read(apic, offset, 4, &val))
+			return 0;
+
+		pr_debug("SVM: %s: TMCCT: rip=%#lx, next_rip=%#llx, val=%#x)\n",
+			 __func__, kvm_rip_read(&svm->vcpu), svm->next_rip, val);
+
+		*avic_get_bk_page_entry(svm, offset) = val;
+		break;
+	}
+	default:
+		pr_debug("SVM: %s: (rip=%#lx), offset=%#x\n", __func__,
+			 kvm_rip_read(&svm->vcpu), offset);
+		break;
+	}
+
+	return 1;
+}
+
+static int avic_noaccel_fault_write(struct vcpu_svm *svm)
+{
+	u32 offset = svm->vmcb->control.exit_info_1 & 0xFF0;
+
+	pr_debug("SVM: %s: offset=%x\n", __func__, offset);
+
+	switch (offset) {
+	case APIC_ARBPRI: /* APR: Arbitration Priority Register */
+	case APIC_TMCCT: /* Timer Current Count */
+		/* TODO */
+		break;
+	default:
+		BUG();
+	}
+
+	return 1;
+}
+
+static int avic_noaccel_interception(struct vcpu_svm *svm)
+{
+	int ret = 0;
+	u32 offset = svm->vmcb->control.exit_info_1 & 0xFF0;
+	u32 rw = (svm->vmcb->control.exit_info_1 >> 32) & 0x1;
+	u32 vector = svm->vmcb->control.exit_info_2 & 0xFFFFFFFF;
+
+	pr_debug("SVM: %s: offset=%#x, rw=%#x, vector=%#x, vcpu_id=%#x, cpu=%#x\n",
+		 __func__, offset, rw, vector, svm->vcpu.vcpu_id, svm->vcpu.cpu);
+
+	BUG_ON (offset >= 0x400);
+
+	switch(offset) {
+	case APIC_ID:
+	case APIC_EOI:
+	case APIC_RRR:
+	case APIC_LDR:
+	case APIC_DFR:
+	case APIC_SPIV:
+	case APIC_ESR:
+	case APIC_ICR:
+	case APIC_LVTT:
+	case APIC_LVTTHMR:
+	case APIC_LVTPC:
+	case APIC_LVT0:
+	case APIC_LVT1:
+	case APIC_LVTERR:
+	case APIC_TMICT:
+	case APIC_TDCR: {
+		/* Handling Trap */
+		if (!rw) /* Trap read should never happens */
+			BUG();
+		ret = avic_noaccel_trap_write(svm);
+		break;
+	}
+	default: {
+		/* Handling Fault */
+		if (rw)
+			ret = avic_noaccel_fault_write(svm);
+		else
+			ret = avic_noaccel_fault_read(svm);
+		skip_emulated_instruction(&svm->vcpu);
+	}
+	}
+
+	return ret;
+}
+
 static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_READ_CR0]			= cr_interception,
 	[SVM_EXIT_READ_CR3]			= cr_interception,
@@ -3730,6 +3969,8 @@  static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_XSETBV]			= xsetbv_interception,
 	[SVM_EXIT_NPF]				= pf_interception,
 	[SVM_EXIT_RSM]                          = emulate_on_interception,
+	[SVM_EXIT_AVIC_INCMP_IPI]		= avic_incomp_ipi_interception,
+	[SVM_EXIT_AVIC_NOACCEL]			= avic_noaccel_interception,
 };
 
 static void dump_vmcb(struct kvm_vcpu *vcpu)