Message ID | 1511278211-12257-4-git-send-email-liran.alon@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
2017-11-21 17:30+0200, Liran Alon: > Do not consider pending exception when return injected exception > to user-mode. A "pending" exception means it's side-effect have not > been applied yet. In contrast, an "injected" exception means it's > side-effect have been already applied. > Therefore, we only need to report of injected exceptions to user-mode. > This is aligned with how interrupts are reported in same ioctl. Pending interrupts are stored in IRR, but we don't have anything for exceptions -- we would lose a trap exception that was made pending after handling inject_pending_event() if the VCPU got a userspace signal and save+restored before starting the next vcpu_enter_guest() cycle. (Non-trap exceptions should be generated again when re-executing, so losing them isn't that bad.) I think we should add state for pending exceptions in kvm_vcpu_events, like the FIXME says. Pending and injected are actually exclusive (for now?), so maybe we can use only one bit for that, thanks. An alternative, probably unattainable, would be to process the side-effects as we hit the exception. Using IRR to store pending interrupts also seems possible, but I'd expect more problems down the road.
On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote: > Being symmetrical, injecting an exception from user-mode should > consider injected exception as in "injected" state and not in > "pending" state. I disagree with this contention. Suppose, for example, that we are executing in a nested context (i.e. vmcs02 is live) and usermode "injects" a page-fault. The page fault may be delivered on L2's IDT or it may cause an emulated VM-exit from L2 to L1, depending on the page fault error code, the exception bitmap in the vmcs12, and the page-fault error-code mask and match fields in the vmcs12. If the page fault is delivered on L2's IDT, then the exception can be considered as "injected," with the CR2 side effect already processed. However, if the page fault causes an emulated VM-exit from L2 to L1, then it must be considered as "pending" and the CR2 side effect must not have been processed. So, where do we find the new CR2 value? Admittedly, the existing KVM_SET_VCPU_EVENTS API is broken and appears to assume that the side effects have already been performed for exceptions injected from userspace, though this assumption isn't documented AFAICT. Fortunately, there's enough padding in the kvm_vcpu_events structure to fix the API (with the possible exception of injected #VE*): one bit to indicate that userspace is providing the side effects in the events structure, and 64 bits for the new CR2 value (#PF) or the new DR6 bits (#DB). * One could argue that userspace should not be delivering a #VE directly, but should be injecting an EPT Violation instead.
On 22/11/17 22:25, Radim Krčmář wrote: > 2017-11-21 17:30+0200, Liran Alon: >> Do not consider pending exception when return injected exception >> to user-mode. A "pending" exception means it's side-effect have not >> been applied yet. In contrast, an "injected" exception means it's >> side-effect have been already applied. >> Therefore, we only need to report of injected exceptions to user-mode. >> This is aligned with how interrupts are reported in same ioctl. > > Pending interrupts are stored in IRR, but we don't have anything for > exceptions -- we would lose a trap exception that was made pending after > handling inject_pending_event() if the VCPU got a userspace signal and > save+restored before starting the next vcpu_enter_guest() cycle. > (Non-trap exceptions should be generated again when re-executing, so > losing them isn't that bad.) I see your point. I was indeed not considering correctly what will happen with trap exceptions. > > I think we should add state for pending exceptions in kvm_vcpu_events, > like the FIXME says. Pending and injected are actually exclusive (for > now?), so maybe we can use only one bit for that, The FIXME is a bit incorrect as it states "This is only needed for nested virtualization". As I understand this, it is not true as the issue relates to handling trap exceptions in general. But it is true that we should return extra state here. I also observed that pending & injected should be exclusive but didn't want to change code too much in one shot. You are right that in order to handle trap exceptions correctly we should also return to user-space if exception is in pending or injected state. I can change struct kvm_vcpu_events.exception such that: 1. "injected" field will remain logical OR of "exception.pending | .injected". 2. "pad" will become a "flags" field which currently contain a single flag indicating if exception is pending or not. That way, I think I don't break any backwards compatibility. What do you think? Regards, -Liran P.S: I would be glad if you could review also the rest of the patches in this series as they are independent of this change and are also a bit complex. Especially "[PATCH 7/8]: KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending". > > thanks. > > An alternative, probably unattainable, would be to process the > side-effects as we hit the exception. Using IRR to store pending > interrupts also seems possible, but I'd expect more problems down the > road. >
On 22/11/17 23:00, Jim Mattson wrote: > On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote: > >> Being symmetrical, injecting an exception from user-mode should >> consider injected exception as in "injected" state and not in >> "pending" state. > > I disagree with this contention. Suppose, for example, that we are > executing in a nested context (i.e. vmcs02 is live) and usermode > "injects" a page-fault. The page fault may be delivered on L2's IDT or > it may cause an emulated VM-exit from L2 to L1, depending on the page > fault error code, the exception bitmap in the vmcs12, and the > page-fault error-code mask and match fields in the vmcs12. If the page > fault is delivered on L2's IDT, then the exception can be considered > as "injected," with the CR2 side effect already processed. However, if > the page fault causes an emulated VM-exit from L2 to L1, then it must > be considered as "pending" and the CR2 side effect must not have been > processed. I understand you are referring here to the FIXME comment that is present on nested_vmx_check_exception()? > > So, where do we find the new CR2 value? Admittedly, the existing > KVM_SET_VCPU_EVENTS API is broken and appears to assume that the side > effects have already been performed for exceptions injected from > userspace, though this assumption isn't documented AFAICT. > Fortunately, there's enough padding in the kvm_vcpu_events structure > to fix the API (with the possible exception of injected #VE*): one bit > to indicate that userspace is providing the side effects in the events > structure, and 64 bits for the new CR2 value (#PF) or the new DR6 bits > (#DB). I see. Then what do you think about the following change: 1. Rename kvm_vcpu_events.exception.injected to kvm_vcpu_events.exception.raised and remain still to be the logical OR of "exception.pending | exception.injected" as of today (to not break backwards compatibility). 2. Add a new flag to kvm_vcpu_events.flags to indicate if kvm_vcpu_events.exception.raised is actually exception.pending or exception.injected (they are mutually exclusive). A value of 0 will be considered as "injected" to preserve backwards compatibility. 3. Use kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1] for exception_extra_info which will be either CR2 for #PF or DR6 for #DB. 4. Add to kvm_queued_exception() a u64 exception_extra_info that will either be CR2 for #PF or DR6 for #DB. Make sure that these will be set on relevant places and filled to vcpu.arch.cr2/VMCS only on inject_pending_event() injection of a pending exception. I think that in the above proposal, we don't need to be conditional on a new capability because old user-space shouldn't be affected. (They will still get same value in kvm_vcpu_events.exception.raised and the rest were ignored fields). What do you think? Thanks for the excellent review! -Liran > > * One could argue that userspace should not be delivering a #VE > directly, but should be injecting an EPT Violation instead. >
On 23/11/2017 19:45, Liran Alon wrote: > 3. Use kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1] for > exception_extra_info which will be either CR2 for #PF or DR6 for #DB. > 4. Add to kvm_queued_exception() a u64 exception_extra_info that will > either be CR2 for #PF or DR6 for #DB. Make sure that these will be set > on relevant places and filled to vcpu.arch.cr2/VMCS only on > inject_pending_event() injection of a pending exception. An aside: just to complicate things a bit, AMD overwrites the actual DR6 (and thus the "guest" DR6) even if you intercept the #DB vmexit. Paolo
I think we need some way for userspace to indicate whether or not it can deal with pending events (with side effects recorded in kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1]) when it issues a KVM_GET_VCPU_EVENTS ioctl. That seems to argue for a new flag bit in kvm_vcpu_events.flags (as input to KVM_GET_VCPU_EVENTS) and a capability indicating that the flag can be set by userspace. On Thu, Nov 23, 2017 at 10:45 AM, Liran Alon <LIRAN.ALON@oracle.com> wrote: > > > On 22/11/17 23:00, Jim Mattson wrote: >> >> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote: >> >>> Being symmetrical, injecting an exception from user-mode should >>> consider injected exception as in "injected" state and not in >>> "pending" state. >> >> >> I disagree with this contention. Suppose, for example, that we are >> executing in a nested context (i.e. vmcs02 is live) and usermode >> "injects" a page-fault. The page fault may be delivered on L2's IDT or >> it may cause an emulated VM-exit from L2 to L1, depending on the page >> fault error code, the exception bitmap in the vmcs12, and the >> page-fault error-code mask and match fields in the vmcs12. If the page >> fault is delivered on L2's IDT, then the exception can be considered >> as "injected," with the CR2 side effect already processed. However, if >> the page fault causes an emulated VM-exit from L2 to L1, then it must >> be considered as "pending" and the CR2 side effect must not have been >> processed. > > I understand you are referring here to the FIXME comment that is present on > nested_vmx_check_exception()? >> >> >> So, where do we find the new CR2 value? Admittedly, the existing >> KVM_SET_VCPU_EVENTS API is broken and appears to assume that the side >> effects have already been performed for exceptions injected from >> userspace, though this assumption isn't documented AFAICT. >> Fortunately, there's enough padding in the kvm_vcpu_events structure >> to fix the API (with the possible exception of injected #VE*): one bit >> to indicate that userspace is providing the side effects in the events >> structure, and 64 bits for the new CR2 value (#PF) or the new DR6 bits >> (#DB). > > I see. Then what do you think about the following change: > 1. Rename kvm_vcpu_events.exception.injected to > kvm_vcpu_events.exception.raised and remain still to be the logical OR of > "exception.pending | exception.injected" as of today (to not break backwards > compatibility). > 2. Add a new flag to kvm_vcpu_events.flags to indicate if > kvm_vcpu_events.exception.raised is actually exception.pending or > exception.injected (they are mutually exclusive). A value of 0 will be > considered as "injected" to preserve backwards compatibility. > 3. Use kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1] for > exception_extra_info which will be either CR2 for #PF or DR6 for #DB. > 4. Add to kvm_queued_exception() a u64 exception_extra_info that will either > be CR2 for #PF or DR6 for #DB. Make sure that these will be set on relevant > places and filled to vcpu.arch.cr2/VMCS only on inject_pending_event() > injection of a pending exception. > > I think that in the above proposal, we don't need to be conditional on a new > capability because old user-space shouldn't be affected. > (They will still get same value in kvm_vcpu_events.exception.raised and the > rest were ignored fields). > > What do you think? > > Thanks for the excellent review! > -Liran > >> >> * One could argue that userspace should not be delivering a #VE >> directly, but should be injecting an EPT Violation instead. >> >
On 27/11/17 19:26, Jim Mattson wrote: > I think we need some way for userspace to indicate whether or not it > can deal with pending events (with side effects recorded in > kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1]) when it > issues a KVM_GET_VCPU_EVENTS ioctl. That seems to argue for a new flag > bit in kvm_vcpu_events.flags (as input to KVM_GET_VCPU_EVENTS) and a > capability indicating that the flag can be set by userspace. Yep I agree. I will do the relevant changes for the next version of this series. Thanks, -Liran > > On Thu, Nov 23, 2017 at 10:45 AM, Liran Alon <LIRAN.ALON@oracle.com> wrote: >> >> >> On 22/11/17 23:00, Jim Mattson wrote: >>> >>> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote: >>> >>>> Being symmetrical, injecting an exception from user-mode should >>>> consider injected exception as in "injected" state and not in >>>> "pending" state. >>> >>> >>> I disagree with this contention. Suppose, for example, that we are >>> executing in a nested context (i.e. vmcs02 is live) and usermode >>> "injects" a page-fault. The page fault may be delivered on L2's IDT or >>> it may cause an emulated VM-exit from L2 to L1, depending on the page >>> fault error code, the exception bitmap in the vmcs12, and the >>> page-fault error-code mask and match fields in the vmcs12. If the page >>> fault is delivered on L2's IDT, then the exception can be considered >>> as "injected," with the CR2 side effect already processed. However, if >>> the page fault causes an emulated VM-exit from L2 to L1, then it must >>> be considered as "pending" and the CR2 side effect must not have been >>> processed. >> >> I understand you are referring here to the FIXME comment that is present on >> nested_vmx_check_exception()? >>> >>> >>> So, where do we find the new CR2 value? Admittedly, the existing >>> KVM_SET_VCPU_EVENTS API is broken and appears to assume that the side >>> effects have already been performed for exceptions injected from >>> userspace, though this assumption isn't documented AFAICT. >>> Fortunately, there's enough padding in the kvm_vcpu_events structure >>> to fix the API (with the possible exception of injected #VE*): one bit >>> to indicate that userspace is providing the side effects in the events >>> structure, and 64 bits for the new CR2 value (#PF) or the new DR6 bits >>> (#DB). >> >> I see. Then what do you think about the following change: >> 1. Rename kvm_vcpu_events.exception.injected to >> kvm_vcpu_events.exception.raised and remain still to be the logical OR of >> "exception.pending | exception.injected" as of today (to not break backwards >> compatibility). >> 2. Add a new flag to kvm_vcpu_events.flags to indicate if >> kvm_vcpu_events.exception.raised is actually exception.pending or >> exception.injected (they are mutually exclusive). A value of 0 will be >> considered as "injected" to preserve backwards compatibility. >> 3. Use kvm_vcpu_events.reserved[0] and kvm_vcpu_events.reserved[1] for >> exception_extra_info which will be either CR2 for #PF or DR6 for #DB. >> 4. Add to kvm_queued_exception() a u64 exception_extra_info that will either >> be CR2 for #PF or DR6 for #DB. Make sure that these will be set on relevant >> places and filled to vcpu.arch.cr2/VMCS only on inject_pending_event() >> injection of a pending exception. >> >> I think that in the above proposal, we don't need to be conditional on a new >> capability because old user-space shouldn't be affected. >> (They will still get same value in kvm_vcpu_events.exception.raised and the >> rest were ignored fields). >> >> What do you think? >> >> Thanks for the excellent review! >> -Liran >> >>> >>> * One could argue that userspace should not be delivering a #VE >>> directly, but should be injecting an EPT Violation instead. >>> >>
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 45baba8bc02e..1490da89de4b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3096,14 +3096,9 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu, struct kvm_vcpu_events *events) { process_nmi(vcpu); - /* - * FIXME: pass injected and pending separately. This is only - * needed for nested virtualization, whose state cannot be - * migrated yet. For now we can combine them. - */ + events->exception.injected = - (vcpu->arch.exception.pending || - vcpu->arch.exception.injected) && + vcpu->arch.exception.injected && !kvm_exception_is_soft(vcpu->arch.exception.nr); events->exception.nr = vcpu->arch.exception.nr; events->exception.has_error_code = vcpu->arch.exception.has_error_code; @@ -3158,8 +3153,8 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu, return -EINVAL; process_nmi(vcpu); - vcpu->arch.exception.injected = false; - vcpu->arch.exception.pending = events->exception.injected; + vcpu->arch.exception.injected = events->exception.injected; + vcpu->arch.exception.pending = false; vcpu->arch.exception.nr = events->exception.nr; vcpu->arch.exception.has_error_code = events->exception.has_error_code; vcpu->arch.exception.error_code = events->exception.error_code;