Message ID | 20210628124814.1001507-1-stsp2@yandex.ru (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] KVM: X86: Fix exception untrigger on ret to user | expand |
On Mon, 2021-06-28 at 15:48 +0300, Stas Sergeev wrote: > When returning to user, the special care is taken about the > exception that was already injected to VMCS but not yet to guest. > cancel_injection removes such exception from VMCS. It is set as > pending, and if the user does KVM_SET_REGS, it gets completely canceled. > > This didn't happen though, because the vcpu->arch.exception.injected > and vcpu->arch.exception.pending were forgotten to update in > cancel_injection. As the result, KVM_SET_REGS didn't cancel out > anything, and the exception was re-injected on the next KVM_RUN, > even though the guest registers (like EIP) were already modified. > This was leading to an exception coming from the "wrong place". > > This patch makes sure the vcpu->arch.exception.injected and > vcpu->arch.exception.pending are in sync with the reality (and > with VMCS). Also it adds clearing of pending exception to > __set_sregs() the same way it is in __set_regs(). See patch > b4f14abd9 that added it to __set_regs(). > > How to trigger the buggy scenario (that is, without this patch): > - Make sure you have the old CPU where shadow page tables > are used. Core2 family should be fine for the task. In this > case, all PF exceptions produce the exit to monitor. > - You need the _TIF_SIGPENDING flag set at the right moment > to get kvm_vcpu_exit_request() to return true when the PF > exception was just injected. In that case the cancel_injection > path is executed. > - You need the "unlucky" user-space that executes KVM_SET_REGS > at the right moment. This leads to KVM_SET_REGS not clearing > the exception, but instead corrupting its context. > > v2 changes: > - do not add WARN_ON_ONCE() to __set_regs(). As explained by > Vitaly Kuznetsov, it can be user-triggerable. > - clear pending exception also in __set_sregs(). > - update description with the bug-triggering scenario. I used to know that area very very well, and I basically combed the whole thing back and forth, and I have patch series to decouple injected and pending exceptions. I'll refresh my memory on this and then I'll review your patch. My gut feeling is that you discovered too that injections of exceptions from userspace is kind of broken and only works because Qemu doesn't really inject much (only some #MC exceptions which are mostly fatal anyway, and #DB/#BP which only happen when guest debugging is enabled which is not a standard feature). Best regards, Maxim Levitsky > > Signed-off-by: Stas Sergeev <stsp2@yandex.ru> > > CC: Paolo Bonzini <pbonzini@redhat.com> > CC: Sean Christopherson <seanjc@google.com> > CC: Vitaly Kuznetsov <vkuznets@redhat.com> > CC: Jim Mattson <jmattson@google.com> > CC: Joerg Roedel <joro@8bytes.org> > CC: Thomas Gleixner <tglx@linutronix.de> > CC: Ingo Molnar <mingo@redhat.com> > CC: Borislav Petkov <bp@alien8.de> > CC: Jan Kiszka <jan.kiszka@siemens.com> > CC: x86@kernel.org > CC: "H. Peter Anvin" <hpa@zytor.com> > CC: kvm@vger.kernel.org > --- > arch/x86/kvm/x86.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index e0f4a46649d7..d1026e9216e4 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -9450,7 +9450,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > cancel_injection: > if (req_immediate_exit) > kvm_make_request(KVM_REQ_EVENT, vcpu); > - static_call(kvm_x86_cancel_injection)(vcpu); > + if (vcpu->arch.exception.injected) { > + static_call(kvm_x86_cancel_injection)(vcpu); > + vcpu->arch.exception.injected = false; > + vcpu->arch.exception.pending = true; > + } > if (unlikely(vcpu->arch.apic_attention)) > kvm_lapic_sync_from_vapic(vcpu); > out: > @@ -10077,6 +10081,8 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) > pr_debug("Set back pending irq %d\n", pending_vec); > } > > + vcpu->arch.exception.pending = false; > + > kvm_make_request(KVM_REQ_EVENT, vcpu); > > ret = 0;
28.06.2021 17:29, Maxim Levitsky пишет: > I used to know that area very very well, and I basically combed > the whole thing back and forth, > and I have patch series to decouple injected and > pending exceptions. Yes, and also I dislike the fact that you can't easily distinguish the exception injected from user-space, with the PF coming from the guest itself. They occupy the same pending/injected fields. Some refactoring will definitely not hurt. > I'll refresh my memory on this and then I'll review your patch. > > My gut feeling is that you discovered too that injections of > exceptions from userspace is kind of broken and only works > because Qemu doesn't really inject much Actually I discovered that injecting _interrupts_ is kinda broken (on Core2), because they clash with guest's PF. Maybe if I would be using KVM-supplied injection APIs, I would avoid the problem. But I just use KVM_SET_REGS to inject the interrupt, which perhaps qemu doesn't do.
On Mon, Jun 28, 2021 at 8:09 AM stsp <stsp2@yandex.ru> wrote: > > 28.06.2021 17:29, Maxim Levitsky пишет: > > I used to know that area very very well, and I basically combed > > the whole thing back and forth, > > and I have patch series to decouple injected and > > pending exceptions. > > Yes, and also I dislike the fact > that you can't easily distinguish > the exception injected from > user-space, with the PF coming > from the guest itself. They occupy > the same pending/injected fields. > Some refactoring will definitely > not hurt. > > > > I'll refresh my memory on this and then I'll review your patch. > > > > My gut feeling is that you discovered too that injections of > > exceptions from userspace is kind of broken and only works > > because Qemu doesn't really inject much > > Actually I discovered that injecting > _interrupts_ is kinda broken (on Core2), > because they clash with guest's PF. > Maybe if I would be using KVM-supplied > injection APIs, I would avoid the problem. > But I just use KVM_SET_REGS to inject > the interrupt, which perhaps qemu doesn't > do. > I don't know how you can inject an interrupt with KVM_SET_REGS, but I suspect that you're doing something wrong. :-) If I wanted to inject an interrupt from userspace, I would use KVM_SET_LAPIC (assuming that the local APIC is active) to set the appropriate bit in IRRV. Before you can deliver an interrupt, you have to check the local APIC anyway, to see whether or not your interrupt would be blocked by PPR.
28.06.2021 19:50, Jim Mattson пишет: > I don't know how you can inject an interrupt with KVM_SET_REGS, but I > suspect that you're doing something wrong. :-) > > If I wanted to inject an interrupt from userspace, I would use > KVM_SET_LAPIC (assuming that the local APIC is active) to set the > appropriate bit in IRRV. Before you can deliver an interrupt, you have > to check the local APIC anyway, to see whether or not your interrupt > would be blocked by PPR. I do not use any of the emulated HW in KVM. PIC is in user-space, and no apic. This is a very simple hypervisor for running the ring3 code only. So to inject an interrupt, I create a stack frame and set eflags/cs/eip to the needed values, and that's all. Just as people did in a pre-KVM era. :)
On Mon, 2021-06-28 at 17:29 +0300, Maxim Levitsky wrote: > On Mon, 2021-06-28 at 15:48 +0300, Stas Sergeev wrote: > > When returning to user, the special care is taken about the > > exception that was already injected to VMCS but not yet to guest. > > cancel_injection removes such exception from VMCS. It is set as > > pending, and if the user does KVM_SET_REGS, it gets completely canceled. > > > > This didn't happen though, because the vcpu->arch.exception.injected > > and vcpu->arch.exception.pending were forgotten to update in > > cancel_injection. As the result, KVM_SET_REGS didn't cancel out > > anything, and the exception was re-injected on the next KVM_RUN, > > even though the guest registers (like EIP) were already modified. > > This was leading to an exception coming from the "wrong place". > > > > This patch makes sure the vcpu->arch.exception.injected and > > vcpu->arch.exception.pending are in sync with the reality (and > > with VMCS). Also it adds clearing of pending exception to > > __set_sregs() the same way it is in __set_regs(). See patch > > b4f14abd9 that added it to __set_regs(). > > > > How to trigger the buggy scenario (that is, without this patch): > > - Make sure you have the old CPU where shadow page tables > > are used. Core2 family should be fine for the task. In this > > case, all PF exceptions produce the exit to monitor. > > - You need the _TIF_SIGPENDING flag set at the right moment > > to get kvm_vcpu_exit_request() to return true when the PF > > exception was just injected. In that case the cancel_injection > > path is executed. > > - You need the "unlucky" user-space that executes KVM_SET_REGS > > at the right moment. This leads to KVM_SET_REGS not clearing > > the exception, but instead corrupting its context. > > > > v2 changes: > > - do not add WARN_ON_ONCE() to __set_regs(). As explained by > > Vitaly Kuznetsov, it can be user-triggerable. > > - clear pending exception also in __set_sregs(). > > - update description with the bug-triggering scenario. > > I used to know that area very very well, and I basically combed > the whole thing back and forth, > and I have patch series to decouple injected and > pending exceptions. > > I'll refresh my memory on this and then I'll review your patch. Hi! Sorry for the delayed response. First of all indeed an exception can be either in pending or injected state. A pending exception is an exception KVM created but that didn't yet change the guest state. It still has to be injected to the guest on its current instruction, thus its not like a pending interrupt. An injected exception is an exception which was already injected to the guest (or at least attempted an injection but that was aborted) Since the exception payload is injected prior to pushing the error code on the stack and then jumping to the exception handler, the guest state is already modified. There can be 2 reasons why exception delivery is aborted like that: 1. By KVM (that is cancel_injection), when KVM needs to exit to userspace due to a signal 2. By the CPU, when CPU detects another exception or paging fault, while delivering this exception. In both cases the exception stays in the 'injected' state, and has to be injected again (that is why I don't really like the 'injected' term, since it is more like 'injection_started'). Now about the KVM's userspace API where this is exposed: I see now too that KVM_SET_REGS clears the pending exception. This is new to me and it is IMHO *wrong* thing to do. However I bet that someone somewhere depends on this, since this behavior is very old. This doesn't affect qemu since when it does KVM_SET_REGS, it also does KVM_SET_VCPU_EVENTS afterward, and that restores either pending or injected exception state. (that state is first read with KVM_GET_VCPU_EVENTS ioctl). Also note just for reference that KVM_SET_SREGS has ability to set a pending interrupt, something that is also redundant since KVM_SET_VCPU_EVENTS does this. I recently added KVM_SET_SREGS2 ioctl which now lacks this ability. KVM_SET_VCPU_EVENTS/KVM_GET_VCPU_EVENTS allows to read/write state of both pending and injected exception and on top of that allows to read/write the exception payload of a pending exception. You should consider using it to preserve/modify the exception state, although the later isn't recommended (qemu does this in few places, but it is a bit buggy, especially in regard to nested guests). Best regards, Maxim Levitsky
06.07.2021 14:49, Maxim Levitsky пишет: > Now about the KVM's userspace API where this is exposed: > > I see now too that KVM_SET_REGS clears the pending exception. > This is new to me and it is IMHO *wrong* thing to do. > However I bet that someone somewhere depends on this, > since this behavior is very old. What alternative would you suggest? Check for ready_for_interrupt_injection and never call KVM_SET_REGS if it indicates "not ready"? But what if someone calls it nevertheless? Perhaps return an error from KVM_SET_REGS if exception is pending? Also KVM_SET_SREGS needs some treatment here too, as it can also be called when an exception is pending, leading to problems. > This doesn't affect qemu since when it does KVM_SET_REGS, > it also does KVM_SET_VCPU_EVENTS afterward, and that > restores either pending or injected exception state. > (that state is first read with KVM_GET_VCPU_EVENTS ioctl). > > Also note just for reference that KVM_SET_SREGS has ability > to set a pending interrupt, something that is also redundant > since KVM_SET_VCPU_EVENTS does this. > I recently added KVM_SET_SREGS2 ioctl which now lacks this > ability. > > KVM_SET_VCPU_EVENTS/KVM_GET_VCPU_EVENTS allows to read/write > state of both pending and injected exception and on top of that > allows to read/write the exception payload of a pending exception. > > You should consider using it to preserve/modify the exception state, > although the later isn't recommended (qemu does this in few places, > but it is a bit buggy, especially in regard to nested guests). I need neither to preserve nor modify the exception state. All I need is to safely call KVM_SET_REGS/KVM_SET_SREGS, but that appears unsafe when exception is pending. Please take a look into this commit: https://www.lkml.org/lkml/2020/12/1/324 It was suggested that the removal of "!kvm_event_needs_reinjection(vcpu)" condition from kvm_vcpu_ready_for_interrupt_injection() is what introduced the problem, as right now ready_for_interrupt_injection doesn't account for pending exception. I already added the check to my user-space code, and now it works reliably on some very old kernels prior to the aforementioned patch. So should we treat that as a regressions, or any other proposal?
On Tue, 2021-07-06 at 15:06 +0300, stsp wrote: > 06.07.2021 14:49, Maxim Levitsky пишет: > > Now about the KVM's userspace API where this is exposed: > > > > I see now too that KVM_SET_REGS clears the pending exception. > > This is new to me and it is IMHO *wrong* thing to do. > > However I bet that someone somewhere depends on this, > > since this behavior is very old. > > What alternative would you suggest? > Check for ready_for_interrupt_injection > and never call KVM_SET_REGS if it indicates > "not ready"? > But what if someone calls it nevertheless? > Perhaps return an error from KVM_SET_REGS > if exception is pending? Also KVM_SET_SREGS > needs some treatment here too, as it can > also be called when an exception is pending, > leading to problems. As I explained you can call KVM_GET_VCPU_EVENTS before calling KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct that was filled by KVM_GET_VCPU_EVENTS. That will preserve all the cpu events. > > > > This doesn't affect qemu since when it does KVM_SET_REGS, > > it also does KVM_SET_VCPU_EVENTS afterward, and that > > restores either pending or injected exception state. > > (that state is first read with KVM_GET_VCPU_EVENTS ioctl). > > > > Also note just for reference that KVM_SET_SREGS has ability > > to set a pending interrupt, something that is also redundant > > since KVM_SET_VCPU_EVENTS does this. > > I recently added KVM_SET_SREGS2 ioctl which now lacks this > > ability. > > > > KVM_SET_VCPU_EVENTS/KVM_GET_VCPU_EVENTS allows to read/write > > state of both pending and injected exception and on top of that > > allows to read/write the exception payload of a pending exception. > > > > You should consider using it to preserve/modify the exception state, > > although the later isn't recommended (qemu does this in few places, > > but it is a bit buggy, especially in regard to nested guests). > I need neither to preserve nor modify > the exception state. All I need is to safely > call KVM_SET_REGS/KVM_SET_SREGS, but > that appears unsafe when exception is > pending. > > Please take a look into this commit: > https://www.lkml.org/lkml/2020/12/1/324 > > It was suggested that the removal > of "!kvm_event_needs_reinjection(vcpu)" > condition from kvm_vcpu_ready_for_interrupt_injection() > is what introduced the problem, as > right now ready_for_interrupt_injection > doesn't account for pending exception. > I already added the check to my > user-space code, and now it works > reliably on some very old kernels > prior to the aforementioned patch. > So should we treat that as a regressions, > or any other proposal? I haven't studied the userspace interrupt injection code much but it does sound like if we signal to userspace that irq window is open, that means that indeed there must be no injected interrupts/exceptions. Now let me explain how nesting of events supposed to work on real hardware: Event nesting happens when during delivery of an event (interrupt or exception) we got another event (it has to be exception practically, and it happens only when the delivery process (like switching stacks, reading IDT, pushing stuff to the stack, etc causes an exception) There are 4 combinations of an event, and event that happened during delivery of it. 1. Exception->Exception: There is an exception, and during delivery of it we got an exception (like #PF on accessing GDT, or #NP, #SS, or something like that) In this case exceptions are merged which means that they are either converted to #DF, first exception is lost, or we get a triple fault (SDM has a table for all cases). 2. Interrupt->Exception If there is an injected interrupt and during delivery of it, we get an exception this means that jump to the interrupt handler caused an exception, in which case the interrupt is lost and we run the exception handler instead. 3. Interrupt->Interrupt 4. Exception->Interrupt. Those two cases can't happen, as interrupts should not be processed while delivering an exception/interrupt. Therefore indeed if we signal the userspace that interrupt window is open, that means that there must be no injected interrupt/exception. If the userspace however wants to inject an exception, currently it can only correctly do so if it uses KVM_GET_VCPU_EVENTS, to see if we already have a injected exception there and then merge itself new exception. And that still doesn't work correctly when nested guests are involved. As I said the userspace exception injection via KVM_GET_VCPU_EVENTS is kind of broken, and only works in qemu since it uses it very rarely. Paolo, what do you think? Best regards, Maxim Levitsky
06.07.2021 23:29, Maxim Levitsky пишет: > On Tue, 2021-07-06 at 15:06 +0300, stsp wrote: >> 06.07.2021 14:49, Maxim Levitsky пишет: >>> Now about the KVM's userspace API where this is exposed: >>> >>> I see now too that KVM_SET_REGS clears the pending exception. >>> This is new to me and it is IMHO *wrong* thing to do. >>> However I bet that someone somewhere depends on this, >>> since this behavior is very old. >> What alternative would you suggest? >> Check for ready_for_interrupt_injection >> and never call KVM_SET_REGS if it indicates >> "not ready"? >> But what if someone calls it nevertheless? >> Perhaps return an error from KVM_SET_REGS >> if exception is pending? Also KVM_SET_SREGS >> needs some treatment here too, as it can >> also be called when an exception is pending, >> leading to problems. > As I explained you can call KVM_GET_VCPU_EVENTS before calling > KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct > that was filled by KVM_GET_VCPU_EVENTS. > That will preserve all the cpu events. The question is different. I wonder how _should_ the KVM API behave when someone calls KVM_SET_REGS/KVM_SET_SREGS while the exception is pending. This is currently not handled properly. We can add/fix the indication with ready_for_interrupt_injection, but someone will ignore that indication, so some handling (like returning an error) should be added. So what would you propose the KVM_SET_REGS should do if it is called when an exception is pending? The question is here because currently KVM_SET_REGS and KVM_SET_SREGS handle that differently: one is trying to cancel the pending excpetion, and the other one does nothing, but both are wrong.
On Wed, 2021-07-07 at 00:50 +0300, stsp wrote: > 06.07.2021 23:29, Maxim Levitsky пишет: > > On Tue, 2021-07-06 at 15:06 +0300, stsp wrote: > > > 06.07.2021 14:49, Maxim Levitsky пишет: > > > > Now about the KVM's userspace API where this is exposed: > > > > > > > > I see now too that KVM_SET_REGS clears the pending exception. > > > > This is new to me and it is IMHO *wrong* thing to do. > > > > However I bet that someone somewhere depends on this, > > > > since this behavior is very old. > > > What alternative would you suggest? > > > Check for ready_for_interrupt_injection > > > and never call KVM_SET_REGS if it indicates > > > "not ready"? > > > But what if someone calls it nevertheless? > > > Perhaps return an error from KVM_SET_REGS > > > if exception is pending? Also KVM_SET_SREGS > > > needs some treatment here too, as it can > > > also be called when an exception is pending, > > > leading to problems. > > As I explained you can call KVM_GET_VCPU_EVENTS before calling > > KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct > > that was filled by KVM_GET_VCPU_EVENTS. > > That will preserve all the cpu events. > > The question is different. > I wonder how _should_ the KVM > API behave when someone calls > KVM_SET_REGS/KVM_SET_SREGS KVM_SET_REGS should not clear the pending exception. but fixing this can break API compatibilitly if some hypervisor (not qemu) relies on it. Thus either a new ioctl is needed or as I said, KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used to preserve the events around that call as workaround. Best regards, Maxim Levitsky > while the exception is pending. > This is currently not handled properly. > We can add/fix the indication with > ready_for_interrupt_injection, > but someone will ignore that > indication, so some handling > (like returning an error) should > be added. > So what would you propose the > KVM_SET_REGS should do if it is > called when an exception is pending? > The question is here because > currently KVM_SET_REGS and > KVM_SET_SREGS handle that differently: > one is trying to cancel the pending > excpetion, and the other one > does nothing, but both are wrong. >
07.07.2021 02:00, Maxim Levitsky пишет: > On Wed, 2021-07-07 at 00:50 +0300, stsp wrote: >> 06.07.2021 23:29, Maxim Levitsky пишет: >>> On Tue, 2021-07-06 at 15:06 +0300, stsp wrote: >>>> 06.07.2021 14:49, Maxim Levitsky пишет: >>>>> Now about the KVM's userspace API where this is exposed: >>>>> >>>>> I see now too that KVM_SET_REGS clears the pending exception. >>>>> This is new to me and it is IMHO *wrong* thing to do. >>>>> However I bet that someone somewhere depends on this, >>>>> since this behavior is very old. >>>> What alternative would you suggest? >>>> Check for ready_for_interrupt_injection >>>> and never call KVM_SET_REGS if it indicates >>>> "not ready"? >>>> But what if someone calls it nevertheless? >>>> Perhaps return an error from KVM_SET_REGS >>>> if exception is pending? Also KVM_SET_SREGS >>>> needs some treatment here too, as it can >>>> also be called when an exception is pending, >>>> leading to problems. >>> As I explained you can call KVM_GET_VCPU_EVENTS before calling >>> KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct >>> that was filled by KVM_GET_VCPU_EVENTS. >>> That will preserve all the cpu events. >> The question is different. >> I wonder how _should_ the KVM >> API behave when someone calls >> KVM_SET_REGS/KVM_SET_SREGS > KVM_SET_REGS should not clear the pending exception. > but fixing this can break API compatibilitly if some > hypervisor (not qemu) relies on it. > > Thus either a new ioctl is needed or as I said, > KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used > to preserve the events around that call as workaround. But I don't need to preserve events. Canceling is perfectly fine with me because, if I inject the interrupt at that point, the exception will be re-triggered anyway after interrupt handler returns. What I ask is how SHOULD the KVM_SET_REGS and KVM_SET_SREGS behave when someone (mistakenly) calls them with the exception pending. Should they return an error instead of canceling exception?
On 07/07/21 01:06, stsp wrote: > What I ask is how SHOULD the > KVM_SET_REGS and KVM_SET_SREGS > behave when someone (mistakenly) > calls them with the exception pending. > Should they return an error > instead of canceling exception? In theory, KVM_SET_REGS and KVM_SET_SREGS should do nothing but set the value of the registers. They not should clear either vcpu->arch.exception.pending or vcpu->arch.exception.injected. I'm wary of changing that and breaking users of KVM, though. In this case the problem is that, with a pending exception, you should not inject the interrupt (doesn't matter if it's with KVM_SET_REGS or KVM_INTERRUPT). Raising a page fault is part of executing the previous instruction, and interrupts are only recognized at instruction boundaries. Therefore, you need to test ready_for_interrupt_injection, and possibly use request_interrupt_window, before calling KVM_SET_REGS. The patch you identified as the culprit does have a bug, but that's fixed in kvm_cpu_accept_dm_intr as I suggested in the other thread. Paolo
07.07.2021 02:45, Paolo Bonzini пишет: > On 07/07/21 01:06, stsp wrote: >> What I ask is how SHOULD the >> KVM_SET_REGS and KVM_SET_SREGS >> behave when someone (mistakenly) >> calls them with the exception pending. >> Should they return an error >> instead of canceling exception? > > In theory, KVM_SET_REGS and KVM_SET_SREGS should do nothing but set > the value of the registers. They not should clear either > vcpu->arch.exception.pending or vcpu->arch.exception.injected. Maybe they should return an error, or do something else to alert the user? Otherwise the 100% wrong usage gets unnoticed.
On 07/07/21 01:51, stsp wrote: > 07.07.2021 02:45, Paolo Bonzini пишет: >> On 07/07/21 01:06, stsp wrote: >>> What I ask is how SHOULD the >>> KVM_SET_REGS and KVM_SET_SREGS >>> behave when someone (mistakenly) >>> calls them with the exception pending. >>> Should they return an error >>> instead of canceling exception? >> >> In theory, KVM_SET_REGS and KVM_SET_SREGS should do nothing but set >> the value of the registers. They not should clear either >> vcpu->arch.exception.pending or vcpu->arch.exception.injected. > Maybe they should return an > error, or do something else to > alert the user? Otherwise the > 100% wrong usage gets unnoticed. No, setting the registers means that: setting the registers. If you set RSP, the new value will be used for delivering the exception. Paolo
On Tue, Jul 6, 2021 at 4:06 PM stsp <stsp2@yandex.ru> wrote: > > 07.07.2021 02:00, Maxim Levitsky пишет: > > On Wed, 2021-07-07 at 00:50 +0300, stsp wrote: > >> 06.07.2021 23:29, Maxim Levitsky пишет: > >>> On Tue, 2021-07-06 at 15:06 +0300, stsp wrote: > >>>> 06.07.2021 14:49, Maxim Levitsky пишет: > >>>>> Now about the KVM's userspace API where this is exposed: > >>>>> > >>>>> I see now too that KVM_SET_REGS clears the pending exception. > >>>>> This is new to me and it is IMHO *wrong* thing to do. > >>>>> However I bet that someone somewhere depends on this, > >>>>> since this behavior is very old. > >>>> What alternative would you suggest? > >>>> Check for ready_for_interrupt_injection > >>>> and never call KVM_SET_REGS if it indicates > >>>> "not ready"? > >>>> But what if someone calls it nevertheless? > >>>> Perhaps return an error from KVM_SET_REGS > >>>> if exception is pending? Also KVM_SET_SREGS > >>>> needs some treatment here too, as it can > >>>> also be called when an exception is pending, > >>>> leading to problems. > >>> As I explained you can call KVM_GET_VCPU_EVENTS before calling > >>> KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct > >>> that was filled by KVM_GET_VCPU_EVENTS. > >>> That will preserve all the cpu events. > >> The question is different. > >> I wonder how _should_ the KVM > >> API behave when someone calls > >> KVM_SET_REGS/KVM_SET_SREGS > > KVM_SET_REGS should not clear the pending exception. > > but fixing this can break API compatibilitly if some > > hypervisor (not qemu) relies on it. > > > > Thus either a new ioctl is needed or as I said, > > KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used > > to preserve the events around that call as workaround. > But I don't need to preserve > events. Canceling is perfectly > fine with me because, if I inject > the interrupt at that point, the > exception will be re-triggered > anyway after interrupt handler > returns. The exception will not be re-triggered if it was a trap, because the instruction that caused the exception will already have retired, and RIP will have advanced. Moreover, if the exception had any side-effects (#PF and sometimes #DB), those side-effects have already happened, so if IDT vectoring is cancelled, the vCPU is left in an architecturally unreachable state. > What I ask is how SHOULD the > KVM_SET_REGS and KVM_SET_SREGS > behave when someone (mistakenly) > calls them with the exception pending. > Should they return an error > instead of canceling exception?
07.07.2021 19:16, Jim Mattson пишет: > On Tue, Jul 6, 2021 at 4:06 PM stsp <stsp2@yandex.ru> wrote: >> 07.07.2021 02:00, Maxim Levitsky пишет: >>> On Wed, 2021-07-07 at 00:50 +0300, stsp wrote: >>>> 06.07.2021 23:29, Maxim Levitsky пишет: >>>>> On Tue, 2021-07-06 at 15:06 +0300, stsp wrote: >>>>>> 06.07.2021 14:49, Maxim Levitsky пишет: >>>>>>> Now about the KVM's userspace API where this is exposed: >>>>>>> >>>>>>> I see now too that KVM_SET_REGS clears the pending exception. >>>>>>> This is new to me and it is IMHO *wrong* thing to do. >>>>>>> However I bet that someone somewhere depends on this, >>>>>>> since this behavior is very old. >>>>>> What alternative would you suggest? >>>>>> Check for ready_for_interrupt_injection >>>>>> and never call KVM_SET_REGS if it indicates >>>>>> "not ready"? >>>>>> But what if someone calls it nevertheless? >>>>>> Perhaps return an error from KVM_SET_REGS >>>>>> if exception is pending? Also KVM_SET_SREGS >>>>>> needs some treatment here too, as it can >>>>>> also be called when an exception is pending, >>>>>> leading to problems. >>>>> As I explained you can call KVM_GET_VCPU_EVENTS before calling >>>>> KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct >>>>> that was filled by KVM_GET_VCPU_EVENTS. >>>>> That will preserve all the cpu events. >>>> The question is different. >>>> I wonder how _should_ the KVM >>>> API behave when someone calls >>>> KVM_SET_REGS/KVM_SET_SREGS >>> KVM_SET_REGS should not clear the pending exception. >>> but fixing this can break API compatibilitly if some >>> hypervisor (not qemu) relies on it. >>> >>> Thus either a new ioctl is needed or as I said, >>> KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used >>> to preserve the events around that call as workaround. >> But I don't need to preserve >> events. Canceling is perfectly >> fine with me because, if I inject >> the interrupt at that point, the >> exception will be re-triggered >> anyway after interrupt handler >> returns. > The exception will not be re-triggered if it was a trap, But my assumption was that everything is atomic, except PF with shadow page tables. I guess you mean the cases when the exception delivery causes EPT fault, which is a bit of a corner case.
On Wed, Jul 7, 2021 at 9:34 AM stsp <stsp2@yandex.ru> wrote: > > 07.07.2021 19:16, Jim Mattson пишет: > > On Tue, Jul 6, 2021 at 4:06 PM stsp <stsp2@yandex.ru> wrote: > >> 07.07.2021 02:00, Maxim Levitsky пишет: > >>> On Wed, 2021-07-07 at 00:50 +0300, stsp wrote: > >>>> 06.07.2021 23:29, Maxim Levitsky пишет: > >>>>> On Tue, 2021-07-06 at 15:06 +0300, stsp wrote: > >>>>>> 06.07.2021 14:49, Maxim Levitsky пишет: > >>>>>>> Now about the KVM's userspace API where this is exposed: > >>>>>>> > >>>>>>> I see now too that KVM_SET_REGS clears the pending exception. > >>>>>>> This is new to me and it is IMHO *wrong* thing to do. > >>>>>>> However I bet that someone somewhere depends on this, > >>>>>>> since this behavior is very old. > >>>>>> What alternative would you suggest? > >>>>>> Check for ready_for_interrupt_injection > >>>>>> and never call KVM_SET_REGS if it indicates > >>>>>> "not ready"? > >>>>>> But what if someone calls it nevertheless? > >>>>>> Perhaps return an error from KVM_SET_REGS > >>>>>> if exception is pending? Also KVM_SET_SREGS > >>>>>> needs some treatment here too, as it can > >>>>>> also be called when an exception is pending, > >>>>>> leading to problems. > >>>>> As I explained you can call KVM_GET_VCPU_EVENTS before calling > >>>>> KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct > >>>>> that was filled by KVM_GET_VCPU_EVENTS. > >>>>> That will preserve all the cpu events. > >>>> The question is different. > >>>> I wonder how _should_ the KVM > >>>> API behave when someone calls > >>>> KVM_SET_REGS/KVM_SET_SREGS > >>> KVM_SET_REGS should not clear the pending exception. > >>> but fixing this can break API compatibilitly if some > >>> hypervisor (not qemu) relies on it. > >>> > >>> Thus either a new ioctl is needed or as I said, > >>> KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used > >>> to preserve the events around that call as workaround. > >> But I don't need to preserve > >> events. Canceling is perfectly > >> fine with me because, if I inject > >> the interrupt at that point, the > >> exception will be re-triggered > >> anyway after interrupt handler > >> returns. > > The exception will not be re-triggered if it was a trap, > But my assumption was that > everything is atomic, except > PF with shadow page tables. > I guess you mean the cases > when the exception delivery > causes EPT fault, which is a bit > of a corner case. No, that's not what I mean. Consider the #DB exception, which is intercepted in all configurations to circumvent a DoS attack. Some #DB exceptions modify DR6. Once the exception has been 'injected,' DR6 has already been modified. If you do not complete the injection, but you deliver an interrupt instead, then the interrupt handler can see a DR6 value that is architecturally impossible.
07.07.2021 19:46, Jim Mattson пишет: > On Wed, Jul 7, 2021 at 9:34 AM stsp <stsp2@yandex.ru> wrote: >> 07.07.2021 19:16, Jim Mattson пишет: >>> On Tue, Jul 6, 2021 at 4:06 PM stsp <stsp2@yandex.ru> wrote: >>>> 07.07.2021 02:00, Maxim Levitsky пишет: >>>>> On Wed, 2021-07-07 at 00:50 +0300, stsp wrote: >>>>>> 06.07.2021 23:29, Maxim Levitsky пишет: >>>>>>> On Tue, 2021-07-06 at 15:06 +0300, stsp wrote: >>>>>>>> 06.07.2021 14:49, Maxim Levitsky пишет: >>>>>>>>> Now about the KVM's userspace API where this is exposed: >>>>>>>>> >>>>>>>>> I see now too that KVM_SET_REGS clears the pending exception. >>>>>>>>> This is new to me and it is IMHO *wrong* thing to do. >>>>>>>>> However I bet that someone somewhere depends on this, >>>>>>>>> since this behavior is very old. >>>>>>>> What alternative would you suggest? >>>>>>>> Check for ready_for_interrupt_injection >>>>>>>> and never call KVM_SET_REGS if it indicates >>>>>>>> "not ready"? >>>>>>>> But what if someone calls it nevertheless? >>>>>>>> Perhaps return an error from KVM_SET_REGS >>>>>>>> if exception is pending? Also KVM_SET_SREGS >>>>>>>> needs some treatment here too, as it can >>>>>>>> also be called when an exception is pending, >>>>>>>> leading to problems. >>>>>>> As I explained you can call KVM_GET_VCPU_EVENTS before calling >>>>>>> KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct >>>>>>> that was filled by KVM_GET_VCPU_EVENTS. >>>>>>> That will preserve all the cpu events. >>>>>> The question is different. >>>>>> I wonder how _should_ the KVM >>>>>> API behave when someone calls >>>>>> KVM_SET_REGS/KVM_SET_SREGS >>>>> KVM_SET_REGS should not clear the pending exception. >>>>> but fixing this can break API compatibilitly if some >>>>> hypervisor (not qemu) relies on it. >>>>> >>>>> Thus either a new ioctl is needed or as I said, >>>>> KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used >>>>> to preserve the events around that call as workaround. >>>> But I don't need to preserve >>>> events. Canceling is perfectly >>>> fine with me because, if I inject >>>> the interrupt at that point, the >>>> exception will be re-triggered >>>> anyway after interrupt handler >>>> returns. >>> The exception will not be re-triggered if it was a trap, >> But my assumption was that >> everything is atomic, except >> PF with shadow page tables. >> I guess you mean the cases >> when the exception delivery >> causes EPT fault, which is a bit >> of a corner case. > No, that's not what I mean. Consider the #DB exception, which is > intercepted in all configurations to circumvent a DoS attack. Some #DB > exceptions modify DR6. Once the exception has been 'injected,' DR6 has > already been modified. If you do not complete the injection, but you > deliver an interrupt instead, then the interrupt handler can see a DR6 > value that is architecturally impossible. Yes, I understand that part. It seems to be called the "exception payload" in kvm sources, and includes also CR2 for #PF. So of course if there are many non-atomic cases, rather than just one, then there are no doubts we need to check ready_for_injection. Its just that I was looking at that non-atomicity as a kvm's quirk, but its probably the fundamental part of vmx instead. There is still the problem that KVM_SET_REGS cancels the injection and KVM_SET_SREGS not. But I realize you may want to leave it that way for compatibility.
On Wed, Jul 7, 2021 at 9:58 AM stsp <stsp2@yandex.ru> wrote: > > 07.07.2021 19:46, Jim Mattson пишет: > > On Wed, Jul 7, 2021 at 9:34 AM stsp <stsp2@yandex.ru> wrote: > >> 07.07.2021 19:16, Jim Mattson пишет: > >>> On Tue, Jul 6, 2021 at 4:06 PM stsp <stsp2@yandex.ru> wrote: > >>>> 07.07.2021 02:00, Maxim Levitsky пишет: > >>>>> On Wed, 2021-07-07 at 00:50 +0300, stsp wrote: > >>>>>> 06.07.2021 23:29, Maxim Levitsky пишет: > >>>>>>> On Tue, 2021-07-06 at 15:06 +0300, stsp wrote: > >>>>>>>> 06.07.2021 14:49, Maxim Levitsky пишет: > >>>>>>>>> Now about the KVM's userspace API where this is exposed: > >>>>>>>>> > >>>>>>>>> I see now too that KVM_SET_REGS clears the pending exception. > >>>>>>>>> This is new to me and it is IMHO *wrong* thing to do. > >>>>>>>>> However I bet that someone somewhere depends on this, > >>>>>>>>> since this behavior is very old. > >>>>>>>> What alternative would you suggest? > >>>>>>>> Check for ready_for_interrupt_injection > >>>>>>>> and never call KVM_SET_REGS if it indicates > >>>>>>>> "not ready"? > >>>>>>>> But what if someone calls it nevertheless? > >>>>>>>> Perhaps return an error from KVM_SET_REGS > >>>>>>>> if exception is pending? Also KVM_SET_SREGS > >>>>>>>> needs some treatment here too, as it can > >>>>>>>> also be called when an exception is pending, > >>>>>>>> leading to problems. > >>>>>>> As I explained you can call KVM_GET_VCPU_EVENTS before calling > >>>>>>> KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct > >>>>>>> that was filled by KVM_GET_VCPU_EVENTS. > >>>>>>> That will preserve all the cpu events. > >>>>>> The question is different. > >>>>>> I wonder how _should_ the KVM > >>>>>> API behave when someone calls > >>>>>> KVM_SET_REGS/KVM_SET_SREGS > >>>>> KVM_SET_REGS should not clear the pending exception. > >>>>> but fixing this can break API compatibilitly if some > >>>>> hypervisor (not qemu) relies on it. > >>>>> > >>>>> Thus either a new ioctl is needed or as I said, > >>>>> KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used > >>>>> to preserve the events around that call as workaround. > >>>> But I don't need to preserve > >>>> events. Canceling is perfectly > >>>> fine with me because, if I inject > >>>> the interrupt at that point, the > >>>> exception will be re-triggered > >>>> anyway after interrupt handler > >>>> returns. > >>> The exception will not be re-triggered if it was a trap, > >> But my assumption was that > >> everything is atomic, except > >> PF with shadow page tables. > >> I guess you mean the cases > >> when the exception delivery > >> causes EPT fault, which is a bit > >> of a corner case. > > No, that's not what I mean. Consider the #DB exception, which is > > intercepted in all configurations to circumvent a DoS attack. Some #DB > > exceptions modify DR6. Once the exception has been 'injected,' DR6 has > > already been modified. If you do not complete the injection, but you > > deliver an interrupt instead, then the interrupt handler can see a DR6 > > value that is architecturally impossible. > > Yes, I understand that part. > It seems to be called the "exception > payload" in kvm sources, and > includes also CR2 for #PF. > So of course if there are many > non-atomic cases, rather than > just one, then there are no doubts > we need to check ready_for_injection. > Its just that I was looking at that > non-atomicity as a kvm's quirk, > but its probably the fundamental > part of vmx instead. You could argue that it's a Linux quirk. Without EINTR, this probably wouldn't be an issue. > There is still the problem that > KVM_SET_REGS cancels the > injection and KVM_SET_SREGS not. > But I realize you may want to leave > it that way for compatibility. Sadly, broken userspace APIs often have to remain broken. The "fix" is probably to introduce KVM_SET_REGS2.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e0f4a46649d7..d1026e9216e4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9450,7 +9450,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) cancel_injection: if (req_immediate_exit) kvm_make_request(KVM_REQ_EVENT, vcpu); - static_call(kvm_x86_cancel_injection)(vcpu); + if (vcpu->arch.exception.injected) { + static_call(kvm_x86_cancel_injection)(vcpu); + vcpu->arch.exception.injected = false; + vcpu->arch.exception.pending = true; + } if (unlikely(vcpu->arch.apic_attention)) kvm_lapic_sync_from_vapic(vcpu); out: @@ -10077,6 +10081,8 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) pr_debug("Set back pending irq %d\n", pending_vec); } + vcpu->arch.exception.pending = false; + kvm_make_request(KVM_REQ_EVENT, vcpu); ret = 0;
When returning to user, the special care is taken about the exception that was already injected to VMCS but not yet to guest. cancel_injection removes such exception from VMCS. It is set as pending, and if the user does KVM_SET_REGS, it gets completely canceled. This didn't happen though, because the vcpu->arch.exception.injected and vcpu->arch.exception.pending were forgotten to update in cancel_injection. As the result, KVM_SET_REGS didn't cancel out anything, and the exception was re-injected on the next KVM_RUN, even though the guest registers (like EIP) were already modified. This was leading to an exception coming from the "wrong place". This patch makes sure the vcpu->arch.exception.injected and vcpu->arch.exception.pending are in sync with the reality (and with VMCS). Also it adds clearing of pending exception to __set_sregs() the same way it is in __set_regs(). See patch b4f14abd9 that added it to __set_regs(). How to trigger the buggy scenario (that is, without this patch): - Make sure you have the old CPU where shadow page tables are used. Core2 family should be fine for the task. In this case, all PF exceptions produce the exit to monitor. - You need the _TIF_SIGPENDING flag set at the right moment to get kvm_vcpu_exit_request() to return true when the PF exception was just injected. In that case the cancel_injection path is executed. - You need the "unlucky" user-space that executes KVM_SET_REGS at the right moment. This leads to KVM_SET_REGS not clearing the exception, but instead corrupting its context. v2 changes: - do not add WARN_ON_ONCE() to __set_regs(). As explained by Vitaly Kuznetsov, it can be user-triggerable. - clear pending exception also in __set_sregs(). - update description with the bug-triggering scenario. Signed-off-by: Stas Sergeev <stsp2@yandex.ru> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Sean Christopherson <seanjc@google.com> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: Jim Mattson <jmattson@google.com> CC: Joerg Roedel <joro@8bytes.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Ingo Molnar <mingo@redhat.com> CC: Borislav Petkov <bp@alien8.de> CC: Jan Kiszka <jan.kiszka@siemens.com> CC: x86@kernel.org CC: "H. Peter Anvin" <hpa@zytor.com> CC: kvm@vger.kernel.org --- arch/x86/kvm/x86.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)