diff mbox

[v2,3/8] KVM: x86: set/get_events ioctl should consider only injected exceptions

Message ID 1511278211-12257-4-git-send-email-liran.alon@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Liran Alon Nov. 21, 2017, 3:30 p.m. UTC
Do not consider pending exception when return injected exception
to user-mode. A "pending" exception means it's side-effect have not
been applied yet. In contrast, an "injected" exception means it's
side-effect have been already applied.
Therefore, we only need to report of injected exceptions to user-mode.
This is aligned with how interrupts are reported in same ioctl.

Being symmetrical, injecting an exception from user-mode should
consider injected exception as in "injected" state and not in
"pending" state.

Fixes: 664f8e26b00c ("KVM: X86: Fix loss of exception which has not
yet been injected")

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
---
 arch/x86/kvm/x86.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

Comments

Radim Krčmář Nov. 22, 2017, 8:25 p.m. UTC | #1
2017-11-21 17:30+0200, Liran Alon:
> Do not consider pending exception when return injected exception
> to user-mode. A "pending" exception means it's side-effect have not
> been applied yet. In contrast, an "injected" exception means it's
> side-effect have been already applied.
> Therefore, we only need to report of injected exceptions to user-mode.
> This is aligned with how interrupts are reported in same ioctl.

Pending interrupts are stored in IRR, but we don't have anything for
exceptions -- we would lose a trap exception that was made pending after
handling inject_pending_event() if the VCPU got a userspace signal and
save+restored before starting the next vcpu_enter_guest() cycle.
(Non-trap exceptions should be generated again when re-executing, so
 losing them isn't that bad.)

I think we should add state for pending exceptions in kvm_vcpu_events,
like the FIXME says.  Pending and injected are actually exclusive (for
now?), so maybe we can use only one bit for that,

thanks.

An alternative, probably unattainable, would be to process the
side-effects as we hit the exception.  Using IRR to store pending
interrupts also seems possible, but I'd expect more problems down the
road.
Jim Mattson Nov. 22, 2017, 9 p.m. UTC | #2
On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:

> Being symmetrical, injecting an exception from user-mode should
> consider injected exception as in "injected" state and not in
> "pending" state.

I disagree with this contention. Suppose, for example, that we are
executing in a nested context (i.e. vmcs02 is live) and usermode
"injects" a page-fault. The page fault may be delivered on L2's IDT or
it may cause an emulated VM-exit from L2 to L1, depending on the page
fault error code, the exception bitmap in the vmcs12, and the
page-fault error-code mask and match fields in the vmcs12. If the page
fault is delivered on L2's IDT, then the exception can be considered
as "injected," with the CR2 side effect already processed. However, if
the page fault causes an emulated VM-exit from L2 to L1, then it must
be considered as "pending" and the CR2 side effect must not have been
processed.

So, where do we find the new CR2 value? Admittedly, the existing
KVM_SET_VCPU_EVENTS API is broken and appears to assume that the side
effects have already been performed for exceptions injected from
userspace, though this assumption isn't documented AFAICT.
Fortunately, there's enough padding in the kvm_vcpu_events structure
to fix the API (with the possible exception of injected #VE*): one bit
to indicate that userspace is providing the side effects in the events
structure, and 64 bits for the new CR2 value (#PF) or the new DR6 bits
(#DB).

* One could argue that userspace should not be delivering a #VE
directly, but should be injecting an EPT Violation instead.
Liran Alon Nov. 23, 2017, 6:20 p.m. UTC | #3
On 22/11/17 22:25, Radim Krčmář wrote:
> 2017-11-21 17:30+0200, Liran Alon:
>> Do not consider pending exception when return injected exception
>> to user-mode. A "pending" exception means it's side-effect have not
>> been applied yet. In contrast, an "injected" exception means it's
>> side-effect have been already applied.
>> Therefore, we only need to report of injected exceptions to user-mode.
>> This is aligned with how interrupts are reported in same ioctl.
>
> Pending interrupts are stored in IRR, but we don't have anything for
> exceptions -- we would lose a trap exception that was made pending after
> handling inject_pending_event() if the VCPU got a userspace signal and
> save+restored before starting the next vcpu_enter_guest() cycle.
> (Non-trap exceptions should be generated again when re-executing, so
>   losing them isn't that bad.)
I see your point. I was indeed not considering correctly what will 
happen with trap exceptions.
>
> I think we should add state for pending exceptions in kvm_vcpu_events,
> like the FIXME says.  Pending and injected are actually exclusive (for
> now?), so maybe we can use only one bit for that,
The FIXME is a bit incorrect as it states "This is only needed for 
nested virtualization". As I understand this, it is not true as the 
issue relates to handling trap exceptions in general. But it is true 
that we should return extra state here.

I also observed that pending & injected should be exclusive but didn't 
want to change code too much in one shot.

You are right that in order to handle trap exceptions correctly we 
should also return to user-space if exception is in pending or injected 
state.

I can change struct kvm_vcpu_events.exception such that:
1. "injected" field will remain logical OR of "exception.pending | 
.injected".
2. "pad" will become a "flags" field which currently contain a single 
flag indicating if exception is pending or not.

That way, I think I don't break any backwards compatibility.
What do you think?

Regards,
-Liran

P.S:
I would be glad if you could review also the rest of the patches in this 
series as they are independent of this change and are also a bit 
complex. Especially "[PATCH 7/8]: KVM: nVMX: Require immediate-exit when 
event reinjected to L2 and L1 event pending".
>
> thanks.
>
> An alternative, probably unattainable, would be to process the
> side-effects as we hit the exception.  Using IRR to store pending
> interrupts also seems possible, but I'd expect more problems down the
> road.
>
Liran Alon Nov. 23, 2017, 6:45 p.m. UTC | #4
On 22/11/17 23:00, Jim Mattson wrote:
> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>
>> Being symmetrical, injecting an exception from user-mode should
>> consider injected exception as in "injected" state and not in
>> "pending" state.
>
> I disagree with this contention. Suppose, for example, that we are
> executing in a nested context (i.e. vmcs02 is live) and usermode
> "injects" a page-fault. The page fault may be delivered on L2's IDT or
> it may cause an emulated VM-exit from L2 to L1, depending on the page
> fault error code, the exception bitmap in the vmcs12, and the
> page-fault error-code mask and match fields in the vmcs12. If the page
> fault is delivered on L2's IDT, then the exception can be considered
> as "injected," with the CR2 side effect already processed. However, if
> the page fault causes an emulated VM-exit from L2 to L1, then it must
> be considered as "pending" and the CR2 side effect must not have been
> processed.
I understand you are referring here to the FIXME comment that is present 
on nested_vmx_check_exception()?
>
> So, where do we find the new CR2 value? Admittedly, the existing
> KVM_SET_VCPU_EVENTS API is broken and appears to assume that the side
> effects have already been performed for exceptions injected from
> userspace, though this assumption isn't documented AFAICT.
> Fortunately, there's enough padding in the kvm_vcpu_events structure
> to fix the API (with the possible exception of injected #VE*): one bit
> to indicate that userspace is providing the side effects in the events
> structure, and 64 bits for the new CR2 value (#PF) or the new DR6 bits
> (#DB).
I see. Then what do you think about the following change:
1. Rename kvm_vcpu_events.exception.injected to 
kvm_vcpu_events.exception.raised and remain still to be the logical OR 
of "exception.pending | exception.injected" as of today (to not break 
backwards compatibility).
2. Add a new flag to kvm_vcpu_events.flags to indicate if 
kvm_vcpu_events.exception.raised is actually exception.pending or 
exception.injected (they are mutually exclusive). A value of 0 will be 
considered as "injected" to preserve backwards compatibility.
3. Use kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1] for 
exception_extra_info which will be either CR2 for #PF or DR6 for #DB.
4. Add to kvm_queued_exception() a u64 exception_extra_info that will 
either be CR2 for #PF or DR6 for #DB. Make sure that these will be set 
on relevant places and filled to vcpu.arch.cr2/VMCS only on 
inject_pending_event() injection of a pending exception.

I think that in the above proposal, we don't need to be conditional on a 
new capability because old user-space shouldn't be affected.
(They will still get same value in kvm_vcpu_events.exception.raised and 
the rest were ignored fields).

What do you think?

Thanks for the excellent review!
-Liran
>
> * One could argue that userspace should not be delivering a #VE
> directly, but should be injecting an EPT Violation instead.
>
Paolo Bonzini Nov. 23, 2017, 11:05 p.m. UTC | #5
On 23/11/2017 19:45, Liran Alon wrote:
> 3. Use kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1] for
> exception_extra_info which will be either CR2 for #PF or DR6 for #DB.
> 4. Add to kvm_queued_exception() a u64 exception_extra_info that will
> either be CR2 for #PF or DR6 for #DB. Make sure that these will be set
> on relevant places and filled to vcpu.arch.cr2/VMCS only on
> inject_pending_event() injection of a pending exception.

An aside: just to complicate things a bit, AMD overwrites the actual DR6
(and thus the "guest" DR6) even if you intercept the #DB vmexit.

Paolo
Jim Mattson Nov. 27, 2017, 5:26 p.m. UTC | #6
I think we need some way for userspace to indicate whether or not it
can deal with pending events (with side effects recorded in
kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1]) when it
issues a KVM_GET_VCPU_EVENTS ioctl. That seems to argue for a new flag
bit in kvm_vcpu_events.flags (as input to KVM_GET_VCPU_EVENTS) and a
capability indicating that the flag can be set by userspace.

On Thu, Nov 23, 2017 at 10:45 AM, Liran Alon <LIRAN.ALON@oracle.com> wrote:
>
>
> On 22/11/17 23:00, Jim Mattson wrote:
>>
>> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>>
>>> Being symmetrical, injecting an exception from user-mode should
>>> consider injected exception as in "injected" state and not in
>>> "pending" state.
>>
>>
>> I disagree with this contention. Suppose, for example, that we are
>> executing in a nested context (i.e. vmcs02 is live) and usermode
>> "injects" a page-fault. The page fault may be delivered on L2's IDT or
>> it may cause an emulated VM-exit from L2 to L1, depending on the page
>> fault error code, the exception bitmap in the vmcs12, and the
>> page-fault error-code mask and match fields in the vmcs12. If the page
>> fault is delivered on L2's IDT, then the exception can be considered
>> as "injected," with the CR2 side effect already processed. However, if
>> the page fault causes an emulated VM-exit from L2 to L1, then it must
>> be considered as "pending" and the CR2 side effect must not have been
>> processed.
>
> I understand you are referring here to the FIXME comment that is present on
> nested_vmx_check_exception()?
>>
>>
>> So, where do we find the new CR2 value? Admittedly, the existing
>> KVM_SET_VCPU_EVENTS API is broken and appears to assume that the side
>> effects have already been performed for exceptions injected from
>> userspace, though this assumption isn't documented AFAICT.
>> Fortunately, there's enough padding in the kvm_vcpu_events structure
>> to fix the API (with the possible exception of injected #VE*): one bit
>> to indicate that userspace is providing the side effects in the events
>> structure, and 64 bits for the new CR2 value (#PF) or the new DR6 bits
>> (#DB).
>
> I see. Then what do you think about the following change:
> 1. Rename kvm_vcpu_events.exception.injected to
> kvm_vcpu_events.exception.raised and remain still to be the logical OR of
> "exception.pending | exception.injected" as of today (to not break backwards
> compatibility).
> 2. Add a new flag to kvm_vcpu_events.flags to indicate if
> kvm_vcpu_events.exception.raised is actually exception.pending or
> exception.injected (they are mutually exclusive). A value of 0 will be
> considered as "injected" to preserve backwards compatibility.
> 3. Use kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1] for
> exception_extra_info which will be either CR2 for #PF or DR6 for #DB.
> 4. Add to kvm_queued_exception() a u64 exception_extra_info that will either
> be CR2 for #PF or DR6 for #DB. Make sure that these will be set on relevant
> places and filled to vcpu.arch.cr2/VMCS only on inject_pending_event()
> injection of a pending exception.
>
> I think that in the above proposal, we don't need to be conditional on a new
> capability because old user-space shouldn't be affected.
> (They will still get same value in kvm_vcpu_events.exception.raised and the
> rest were ignored fields).
>
> What do you think?
>
> Thanks for the excellent review!
> -Liran
>
>>
>> * One could argue that userspace should not be delivering a #VE
>> directly, but should be injecting an EPT Violation instead.
>>
>
Liran Alon Nov. 27, 2017, 6:30 p.m. UTC | #7
On 27/11/17 19:26, Jim Mattson wrote:
> I think we need some way for userspace to indicate whether or not it
> can deal with pending events (with side effects recorded in
> kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1]) when it
> issues a KVM_GET_VCPU_EVENTS ioctl. That seems to argue for a new flag
> bit in kvm_vcpu_events.flags (as input to KVM_GET_VCPU_EVENTS) and a
> capability indicating that the flag can be set by userspace.
Yep I agree. I will do the relevant changes for the next version of this 
series.

Thanks,
-Liran
>
> On Thu, Nov 23, 2017 at 10:45 AM, Liran Alon <LIRAN.ALON@oracle.com> wrote:
>>
>>
>> On 22/11/17 23:00, Jim Mattson wrote:
>>>
>>> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>>>
>>>> Being symmetrical, injecting an exception from user-mode should
>>>> consider injected exception as in "injected" state and not in
>>>> "pending" state.
>>>
>>>
>>> I disagree with this contention. Suppose, for example, that we are
>>> executing in a nested context (i.e. vmcs02 is live) and usermode
>>> "injects" a page-fault. The page fault may be delivered on L2's IDT or
>>> it may cause an emulated VM-exit from L2 to L1, depending on the page
>>> fault error code, the exception bitmap in the vmcs12, and the
>>> page-fault error-code mask and match fields in the vmcs12. If the page
>>> fault is delivered on L2's IDT, then the exception can be considered
>>> as "injected," with the CR2 side effect already processed. However, if
>>> the page fault causes an emulated VM-exit from L2 to L1, then it must
>>> be considered as "pending" and the CR2 side effect must not have been
>>> processed.
>>
>> I understand you are referring here to the FIXME comment that is present on
>> nested_vmx_check_exception()?
>>>
>>>
>>> So, where do we find the new CR2 value? Admittedly, the existing
>>> KVM_SET_VCPU_EVENTS API is broken and appears to assume that the side
>>> effects have already been performed for exceptions injected from
>>> userspace, though this assumption isn't documented AFAICT.
>>> Fortunately, there's enough padding in the kvm_vcpu_events structure
>>> to fix the API (with the possible exception of injected #VE*): one bit
>>> to indicate that userspace is providing the side effects in the events
>>> structure, and 64 bits for the new CR2 value (#PF) or the new DR6 bits
>>> (#DB).
>>
>> I see. Then what do you think about the following change:
>> 1. Rename kvm_vcpu_events.exception.injected to
>> kvm_vcpu_events.exception.raised and remain still to be the logical OR of
>> "exception.pending | exception.injected" as of today (to not break backwards
>> compatibility).
>> 2. Add a new flag to kvm_vcpu_events.flags to indicate if
>> kvm_vcpu_events.exception.raised is actually exception.pending or
>> exception.injected (they are mutually exclusive). A value of 0 will be
>> considered as "injected" to preserve backwards compatibility.
>> 3. Use kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1] for
>> exception_extra_info which will be either CR2 for #PF or DR6 for #DB.
>> 4. Add to kvm_queued_exception() a u64 exception_extra_info that will either
>> be CR2 for #PF or DR6 for #DB. Make sure that these will be set on relevant
>> places and filled to vcpu.arch.cr2/VMCS only on inject_pending_event()
>> injection of a pending exception.
>>
>> I think that in the above proposal, we don't need to be conditional on a new
>> capability because old user-space shouldn't be affected.
>> (They will still get same value in kvm_vcpu_events.exception.raised and the
>> rest were ignored fields).
>>
>> What do you think?
>>
>> Thanks for the excellent review!
>> -Liran
>>
>>>
>>> * One could argue that userspace should not be delivering a #VE
>>> directly, but should be injecting an EPT Violation instead.
>>>
>>
diff mbox

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 45baba8bc02e..1490da89de4b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3096,14 +3096,9 @@  static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 					       struct kvm_vcpu_events *events)
 {
 	process_nmi(vcpu);
-	/*
-	 * FIXME: pass injected and pending separately.  This is only
-	 * needed for nested virtualization, whose state cannot be
-	 * migrated yet.  For now we can combine them.
-	 */
+
 	events->exception.injected =
-		(vcpu->arch.exception.pending ||
-		 vcpu->arch.exception.injected) &&
+		vcpu->arch.exception.injected &&
 		!kvm_exception_is_soft(vcpu->arch.exception.nr);
 	events->exception.nr = vcpu->arch.exception.nr;
 	events->exception.has_error_code = vcpu->arch.exception.has_error_code;
@@ -3158,8 +3153,8 @@  static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 		return -EINVAL;
 
 	process_nmi(vcpu);
-	vcpu->arch.exception.injected = false;
-	vcpu->arch.exception.pending = events->exception.injected;
+	vcpu->arch.exception.injected = events->exception.injected;
+	vcpu->arch.exception.pending = false;
 	vcpu->arch.exception.nr = events->exception.nr;
 	vcpu->arch.exception.has_error_code = events->exception.has_error_code;
 	vcpu->arch.exception.error_code = events->exception.error_code;