diff mbox series

[v2] KVM: X86: Fix exception untrigger on ret to user

Message ID 20210628124814.1001507-1-stsp2@yandex.ru (mailing list archive)
State New, archived
Headers show
Series [v2] KVM: X86: Fix exception untrigger on ret to user | expand

Commit Message

Stas Sergeev June 28, 2021, 12:48 p.m. UTC
When returning to user, the special care is taken about the
exception that was already injected to VMCS but not yet to guest.
cancel_injection removes such exception from VMCS. It is set as
pending, and if the user does KVM_SET_REGS, it gets completely canceled.

This didn't happen though, because the vcpu->arch.exception.injected
and vcpu->arch.exception.pending were forgotten to update in
cancel_injection. As the result, KVM_SET_REGS didn't cancel out
anything, and the exception was re-injected on the next KVM_RUN,
even though the guest registers (like EIP) were already modified.
This was leading to an exception coming from the "wrong place".

This patch makes sure the vcpu->arch.exception.injected and
vcpu->arch.exception.pending are in sync with the reality (and
with VMCS). Also it adds clearing of pending exception to
__set_sregs() the same way it is in __set_regs(). See patch
b4f14abd9 that added it to __set_regs().

How to trigger the buggy scenario (that is, without this patch):
- Make sure you have the old CPU where shadow page tables
are used. Core2 family should be fine for the task. In this
case, all PF exceptions produce the exit to monitor.
- You need the _TIF_SIGPENDING flag set at the right moment
to get kvm_vcpu_exit_request() to return true when the PF
exception was just injected. In that case the cancel_injection
path is executed.
- You need the "unlucky" user-space that executes KVM_SET_REGS
at the right moment. This leads to KVM_SET_REGS not clearing
the exception, but instead corrupting its context.

v2 changes:
- do not add WARN_ON_ONCE() to __set_regs(). As explained by
Vitaly Kuznetsov, it can be user-triggerable.
- clear pending exception also in __set_sregs().
- update description with the bug-triggering scenario.

Signed-off-by: Stas Sergeev <stsp2@yandex.ru>

CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Sean Christopherson <seanjc@google.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Jim Mattson <jmattson@google.com>
CC: Joerg Roedel <joro@8bytes.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Jan Kiszka <jan.kiszka@siemens.com>
CC: x86@kernel.org
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: kvm@vger.kernel.org
---
 arch/x86/kvm/x86.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Maxim Levitsky June 28, 2021, 2:29 p.m. UTC | #1
On Mon, 2021-06-28 at 15:48 +0300, Stas Sergeev wrote:
> When returning to user, the special care is taken about the
> exception that was already injected to VMCS but not yet to guest.
> cancel_injection removes such exception from VMCS. It is set as
> pending, and if the user does KVM_SET_REGS, it gets completely canceled.
> 
> This didn't happen though, because the vcpu->arch.exception.injected
> and vcpu->arch.exception.pending were forgotten to update in
> cancel_injection. As the result, KVM_SET_REGS didn't cancel out
> anything, and the exception was re-injected on the next KVM_RUN,
> even though the guest registers (like EIP) were already modified.
> This was leading to an exception coming from the "wrong place".
> 
> This patch makes sure the vcpu->arch.exception.injected and
> vcpu->arch.exception.pending are in sync with the reality (and
> with VMCS). Also it adds clearing of pending exception to
> __set_sregs() the same way it is in __set_regs(). See patch
> b4f14abd9 that added it to __set_regs().
> 
> How to trigger the buggy scenario (that is, without this patch):
> - Make sure you have the old CPU where shadow page tables
> are used. Core2 family should be fine for the task. In this
> case, all PF exceptions produce the exit to monitor.
> - You need the _TIF_SIGPENDING flag set at the right moment
> to get kvm_vcpu_exit_request() to return true when the PF
> exception was just injected. In that case the cancel_injection
> path is executed.
> - You need the "unlucky" user-space that executes KVM_SET_REGS
> at the right moment. This leads to KVM_SET_REGS not clearing
> the exception, but instead corrupting its context.
> 
> v2 changes:
> - do not add WARN_ON_ONCE() to __set_regs(). As explained by
> Vitaly Kuznetsov, it can be user-triggerable.
> - clear pending exception also in __set_sregs().
> - update description with the bug-triggering scenario.

I used to know that area very very well, and I basically combed
the whole thing back and forth, 
and I have patch series to decouple injected and
pending exceptions. 

I'll refresh my memory on this and then I'll review your patch.

My gut feeling is that you discovered too that injections of
exceptions from userspace is kind of broken and only works
because Qemu doesn't really inject much (only some #MC exceptions
which are mostly fatal anyway, and #DB/#BP which only happen
when guest debugging is enabled which is not a standard feature).

Best regards,
	Maxim Levitsky

> 
> Signed-off-by: Stas Sergeev <stsp2@yandex.ru>
> 
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Sean Christopherson <seanjc@google.com>
> CC: Vitaly Kuznetsov <vkuznets@redhat.com>
> CC: Jim Mattson <jmattson@google.com>
> CC: Joerg Roedel <joro@8bytes.org>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Borislav Petkov <bp@alien8.de>
> CC: Jan Kiszka <jan.kiszka@siemens.com>
> CC: x86@kernel.org
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: kvm@vger.kernel.org
> ---
>  arch/x86/kvm/x86.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e0f4a46649d7..d1026e9216e4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9450,7 +9450,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  cancel_injection:
>  	if (req_immediate_exit)
>  		kvm_make_request(KVM_REQ_EVENT, vcpu);
> -	static_call(kvm_x86_cancel_injection)(vcpu);
> +	if (vcpu->arch.exception.injected) {
> +		static_call(kvm_x86_cancel_injection)(vcpu);
> +		vcpu->arch.exception.injected = false;
> +		vcpu->arch.exception.pending = true;
> +	}
>  	if (unlikely(vcpu->arch.apic_attention))
>  		kvm_lapic_sync_from_vapic(vcpu);
>  out:
> @@ -10077,6 +10081,8 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
>  		pr_debug("Set back pending irq %d\n", pending_vec);
>  	}
>  
> +	vcpu->arch.exception.pending = false;
> +
>  	kvm_make_request(KVM_REQ_EVENT, vcpu);
>  
>  	ret = 0;
Stas Sergeev June 28, 2021, 3:09 p.m. UTC | #2
28.06.2021 17:29, Maxim Levitsky пишет:
> I used to know that area very very well, and I basically combed
> the whole thing back and forth,
> and I have patch series to decouple injected and
> pending exceptions.

Yes, and also I dislike the fact
that you can't easily distinguish
the exception injected from
user-space, with the PF coming
from the guest itself. They occupy
the same pending/injected fields.
Some refactoring will definitely
not hurt.


> I'll refresh my memory on this and then I'll review your patch.
>
> My gut feeling is that you discovered too that injections of
> exceptions from userspace is kind of broken and only works
> because Qemu doesn't really inject much

Actually I discovered that injecting
_interrupts_ is kinda broken (on Core2),
because they clash with guest's PF.
Maybe if I would be using KVM-supplied
injection APIs, I would avoid the problem.
But I just use KVM_SET_REGS to inject
the interrupt, which perhaps qemu doesn't
do.
Jim Mattson June 28, 2021, 4:50 p.m. UTC | #3
On Mon, Jun 28, 2021 at 8:09 AM stsp <stsp2@yandex.ru> wrote:
>
> 28.06.2021 17:29, Maxim Levitsky пишет:
> > I used to know that area very very well, and I basically combed
> > the whole thing back and forth,
> > and I have patch series to decouple injected and
> > pending exceptions.
>
> Yes, and also I dislike the fact
> that you can't easily distinguish
> the exception injected from
> user-space, with the PF coming
> from the guest itself. They occupy
> the same pending/injected fields.
> Some refactoring will definitely
> not hurt.
>
>
> > I'll refresh my memory on this and then I'll review your patch.
> >
> > My gut feeling is that you discovered too that injections of
> > exceptions from userspace is kind of broken and only works
> > because Qemu doesn't really inject much
>
> Actually I discovered that injecting
> _interrupts_ is kinda broken (on Core2),
> because they clash with guest's PF.
> Maybe if I would be using KVM-supplied
> injection APIs, I would avoid the problem.
> But I just use KVM_SET_REGS to inject
> the interrupt, which perhaps qemu doesn't
> do.
>

I don't know how you can inject an interrupt with KVM_SET_REGS, but I
suspect that you're doing something wrong. :-)

 If I wanted to inject an interrupt from userspace, I would use
KVM_SET_LAPIC (assuming that the local APIC is active) to set the
appropriate bit in IRRV. Before you can deliver an interrupt, you have
to check the local APIC anyway, to see whether or not your interrupt
would be blocked by PPR.
Stas Sergeev June 28, 2021, 5 p.m. UTC | #4
28.06.2021 19:50, Jim Mattson пишет:
> I don't know how you can inject an interrupt with KVM_SET_REGS, but I
> suspect that you're doing something wrong. :-)
>
>   If I wanted to inject an interrupt from userspace, I would use
> KVM_SET_LAPIC (assuming that the local APIC is active) to set the
> appropriate bit in IRRV. Before you can deliver an interrupt, you have
> to check the local APIC anyway, to see whether or not your interrupt
> would be blocked by PPR.

I do not use any of the emulated
HW in KVM. PIC is in user-space,
and no apic. This is a very simple
hypervisor for running the ring3
code only.
So to inject an interrupt, I create
a stack frame and set eflags/cs/eip
to the needed values, and that's all.
Just as people did in a pre-KVM era. :)
Maxim Levitsky July 6, 2021, 11:49 a.m. UTC | #5
On Mon, 2021-06-28 at 17:29 +0300, Maxim Levitsky wrote:
> On Mon, 2021-06-28 at 15:48 +0300, Stas Sergeev wrote:
> > When returning to user, the special care is taken about the
> > exception that was already injected to VMCS but not yet to guest.
> > cancel_injection removes such exception from VMCS. It is set as
> > pending, and if the user does KVM_SET_REGS, it gets completely canceled.
> > 
> > This didn't happen though, because the vcpu->arch.exception.injected
> > and vcpu->arch.exception.pending were forgotten to update in
> > cancel_injection. As the result, KVM_SET_REGS didn't cancel out
> > anything, and the exception was re-injected on the next KVM_RUN,
> > even though the guest registers (like EIP) were already modified.
> > This was leading to an exception coming from the "wrong place".
> > 
> > This patch makes sure the vcpu->arch.exception.injected and
> > vcpu->arch.exception.pending are in sync with the reality (and
> > with VMCS). Also it adds clearing of pending exception to
> > __set_sregs() the same way it is in __set_regs(). See patch
> > b4f14abd9 that added it to __set_regs().
> > 
> > How to trigger the buggy scenario (that is, without this patch):
> > - Make sure you have the old CPU where shadow page tables
> > are used. Core2 family should be fine for the task. In this
> > case, all PF exceptions produce the exit to monitor.
> > - You need the _TIF_SIGPENDING flag set at the right moment
> > to get kvm_vcpu_exit_request() to return true when the PF
> > exception was just injected. In that case the cancel_injection
> > path is executed.
> > - You need the "unlucky" user-space that executes KVM_SET_REGS
> > at the right moment. This leads to KVM_SET_REGS not clearing
> > the exception, but instead corrupting its context.
> > 
> > v2 changes:
> > - do not add WARN_ON_ONCE() to __set_regs(). As explained by
> > Vitaly Kuznetsov, it can be user-triggerable.
> > - clear pending exception also in __set_sregs().
> > - update description with the bug-triggering scenario.
> 
> I used to know that area very very well, and I basically combed
> the whole thing back and forth, 
> and I have patch series to decouple injected and
> pending exceptions. 
> 
> I'll refresh my memory on this and then I'll review your patch.

Hi!
Sorry for the delayed response.
 
First of all indeed an exception can be either in pending or injected state.
A pending exception is an exception KVM created but that didn't yet change
the guest state. 
 
It still has to be injected to the guest on its current instruction,
thus its not like a pending interrupt.
 
An injected exception is an exception which was already injected to the guest
(or at least attempted an injection but that was aborted)

Since the exception payload is injected prior to pushing the error
code on the stack and then jumping to the exception handler,
the guest state is already modified.

There can be 2 reasons why exception delivery is aborted
like that:
 
1. By KVM (that is cancel_injection), when KVM needs to exit
to userspace due to a signal
 
2. By the CPU, when CPU detects another exception or paging fault,
while delivering this exception.
 
In both cases the exception stays in the 'injected' state, and
has to be injected again (that is why I don't really like the
'injected' term, since it is more like 'injection_started').

Now about the KVM's userspace API where this is exposed:
 
I see now too that KVM_SET_REGS clears the pending exception.
This is new to me and it is IMHO *wrong* thing to do.
However I bet that someone somewhere depends on this,
since this behavior is very old.
 
This doesn't affect qemu since when it does KVM_SET_REGS,
it also does KVM_SET_VCPU_EVENTS afterward, and that 
restores either pending or injected exception state.
(that state is first read with KVM_GET_VCPU_EVENTS ioctl).
 
Also note just for reference that KVM_SET_SREGS has ability
to set a pending interrupt, something that is also redundant
since KVM_SET_VCPU_EVENTS does this.
I recently added KVM_SET_SREGS2 ioctl which now lacks this
ability.

KVM_SET_VCPU_EVENTS/KVM_GET_VCPU_EVENTS allows to read/write
state of both pending and injected exception and on top of that
allows to read/write the exception payload of a pending exception.

You should consider using it to preserve/modify the exception state,
although the later isn't recommended (qemu does this in few places,
but it is a bit buggy, especially in regard to nested guests).

Best regards,
	Maxim Levitsky
Stas Sergeev July 6, 2021, 12:06 p.m. UTC | #6
06.07.2021 14:49, Maxim Levitsky пишет:
> Now about the KVM's userspace API where this is exposed:
>   
> I see now too that KVM_SET_REGS clears the pending exception.
> This is new to me and it is IMHO *wrong* thing to do.
> However I bet that someone somewhere depends on this,
> since this behavior is very old.

What alternative would you suggest?
Check for ready_for_interrupt_injection
and never call KVM_SET_REGS if it indicates
"not ready"?
But what if someone calls it nevertheless?
Perhaps return an error from KVM_SET_REGS
if exception is pending? Also KVM_SET_SREGS
needs some treatment here too, as it can
also be called when an exception is pending,
leading to problems.


> This doesn't affect qemu since when it does KVM_SET_REGS,
> it also does KVM_SET_VCPU_EVENTS afterward, and that
> restores either pending or injected exception state.
> (that state is first read with KVM_GET_VCPU_EVENTS ioctl).
>   
> Also note just for reference that KVM_SET_SREGS has ability
> to set a pending interrupt, something that is also redundant
> since KVM_SET_VCPU_EVENTS does this.
> I recently added KVM_SET_SREGS2 ioctl which now lacks this
> ability.
>
> KVM_SET_VCPU_EVENTS/KVM_GET_VCPU_EVENTS allows to read/write
> state of both pending and injected exception and on top of that
> allows to read/write the exception payload of a pending exception.
>
> You should consider using it to preserve/modify the exception state,
> although the later isn't recommended (qemu does this in few places,
> but it is a bit buggy, especially in regard to nested guests).
I need neither to preserve nor modify
the exception state. All I need is to safely
call KVM_SET_REGS/KVM_SET_SREGS, but
that appears unsafe when exception is
pending.

Please take a look into this commit:
https://www.lkml.org/lkml/2020/12/1/324

It was suggested that the removal
of "!kvm_event_needs_reinjection(vcpu)"
condition from kvm_vcpu_ready_for_interrupt_injection()
is what introduced the problem, as
right now ready_for_interrupt_injection
doesn't account for pending exception.
I already added the check to my
user-space code, and now it works
reliably on some very old kernels
prior to the aforementioned patch.
So should we treat that as a regressions,
or any other proposal?
Maxim Levitsky July 6, 2021, 8:29 p.m. UTC | #7
On Tue, 2021-07-06 at 15:06 +0300, stsp wrote:
> 06.07.2021 14:49, Maxim Levitsky пишет:
> > Now about the KVM's userspace API where this is exposed:
> >   
> > I see now too that KVM_SET_REGS clears the pending exception.
> > This is new to me and it is IMHO *wrong* thing to do.
> > However I bet that someone somewhere depends on this,
> > since this behavior is very old.
> 
> What alternative would you suggest?
> Check for ready_for_interrupt_injection
> and never call KVM_SET_REGS if it indicates
> "not ready"?
> But what if someone calls it nevertheless?
> Perhaps return an error from KVM_SET_REGS
> if exception is pending? Also KVM_SET_SREGS
> needs some treatment here too, as it can
> also be called when an exception is pending,
> leading to problems.

As I explained you can call KVM_GET_VCPU_EVENTS before calling
KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct
that was filled by KVM_GET_VCPU_EVENTS.
That will preserve all the cpu events.

> 
> 
> > This doesn't affect qemu since when it does KVM_SET_REGS,
> > it also does KVM_SET_VCPU_EVENTS afterward, and that
> > restores either pending or injected exception state.
> > (that state is first read with KVM_GET_VCPU_EVENTS ioctl).
> >   
> > Also note just for reference that KVM_SET_SREGS has ability
> > to set a pending interrupt, something that is also redundant
> > since KVM_SET_VCPU_EVENTS does this.
> > I recently added KVM_SET_SREGS2 ioctl which now lacks this
> > ability.
> > 
> > KVM_SET_VCPU_EVENTS/KVM_GET_VCPU_EVENTS allows to read/write
> > state of both pending and injected exception and on top of that
> > allows to read/write the exception payload of a pending exception.
> > 
> > You should consider using it to preserve/modify the exception state,
> > although the later isn't recommended (qemu does this in few places,
> > but it is a bit buggy, especially in regard to nested guests).
> I need neither to preserve nor modify
> the exception state. All I need is to safely
> call KVM_SET_REGS/KVM_SET_SREGS, but
> that appears unsafe when exception is
> pending.
> 
> Please take a look into this commit:
> https://www.lkml.org/lkml/2020/12/1/324
> 
> It was suggested that the removal
> of "!kvm_event_needs_reinjection(vcpu)"
> condition from kvm_vcpu_ready_for_interrupt_injection()
> is what introduced the problem, as
> right now ready_for_interrupt_injection
> doesn't account for pending exception.
> I already added the check to my
> user-space code, and now it works
> reliably on some very old kernels
> prior to the aforementioned patch.
> So should we treat that as a regressions,
> or any other proposal?

I haven't studied the userspace interrupt injection code much but
it does sound like if we signal to userspace that irq window is
open, that means that indeed there must be no injected interrupts/exceptions.

Now let me explain how nesting of events supposed to work on real hardware:

Event nesting happens when during delivery of an event (interrupt or exception)
we got another event (it has to be exception practically, and it happens
only when the delivery process (like switching stacks, reading IDT,
pushing stuff to the stack, etc causes an exception)

There are 4 combinations of an event, and event that happened  during delivery
of it.


1. Exception->Exception:
There is an  exception, and during delivery of it we got an exception
(like #PF on accessing GDT, or #NP, #SS, or something like that)
In this case exceptions are merged which means that they are either converted to 
#DF,  first exception is lost, or we get a triple fault (SDM has a table
for all cases).


2. Interrupt->Exception
If there is an injected interrupt and during delivery of it, we get an exception
this means that jump to the interrupt handler caused an exception,
in which case the interrupt is lost and we run the exception handler instead.


3. Interrupt->Interrupt
4. Exception->Interrupt.

Those two cases can't happen, as interrupts should not be processed while delivering
an exception/interrupt.

Therefore indeed if we signal the userspace that interrupt window is open,
that means that there must be no injected interrupt/exception.

If the userspace however wants to inject an exception, currently it
can only correctly do so if it uses KVM_GET_VCPU_EVENTS, to see if we
already have a injected exception there and then merge itself new
exception. And that still doesn't work correctly when nested guests
are involved. As I said the userspace exception injection via KVM_GET_VCPU_EVENTS
is kind of broken, and only works in qemu since it uses it very rarely.

Paolo, what do you think?


Best regards,
	Maxim Levitsky
Stas Sergeev July 6, 2021, 9:50 p.m. UTC | #8
06.07.2021 23:29, Maxim Levitsky пишет:
> On Tue, 2021-07-06 at 15:06 +0300, stsp wrote:
>> 06.07.2021 14:49, Maxim Levitsky пишет:
>>> Now about the KVM's userspace API where this is exposed:
>>>    
>>> I see now too that KVM_SET_REGS clears the pending exception.
>>> This is new to me and it is IMHO *wrong* thing to do.
>>> However I bet that someone somewhere depends on this,
>>> since this behavior is very old.
>> What alternative would you suggest?
>> Check for ready_for_interrupt_injection
>> and never call KVM_SET_REGS if it indicates
>> "not ready"?
>> But what if someone calls it nevertheless?
>> Perhaps return an error from KVM_SET_REGS
>> if exception is pending? Also KVM_SET_SREGS
>> needs some treatment here too, as it can
>> also be called when an exception is pending,
>> leading to problems.
> As I explained you can call KVM_GET_VCPU_EVENTS before calling
> KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct
> that was filled by KVM_GET_VCPU_EVENTS.
> That will preserve all the cpu events.

The question is different.
I wonder how _should_ the KVM
API behave when someone calls
KVM_SET_REGS/KVM_SET_SREGS
while the exception is pending.
This is currently not handled properly.
We can add/fix the indication with
ready_for_interrupt_injection,
but someone will ignore that
indication, so some handling
(like returning an error) should
be added.
So what would you propose the
KVM_SET_REGS should do if it is
called when an exception is pending?
The question is here because
currently KVM_SET_REGS and
KVM_SET_SREGS handle that differently:
one is trying to cancel the pending
excpetion, and the other one
does nothing, but both are wrong.
Maxim Levitsky July 6, 2021, 11 p.m. UTC | #9
On Wed, 2021-07-07 at 00:50 +0300, stsp wrote:
> 06.07.2021 23:29, Maxim Levitsky пишет:
> > On Tue, 2021-07-06 at 15:06 +0300, stsp wrote:
> > > 06.07.2021 14:49, Maxim Levitsky пишет:
> > > > Now about the KVM's userspace API where this is exposed:
> > > >    
> > > > I see now too that KVM_SET_REGS clears the pending exception.
> > > > This is new to me and it is IMHO *wrong* thing to do.
> > > > However I bet that someone somewhere depends on this,
> > > > since this behavior is very old.
> > > What alternative would you suggest?
> > > Check for ready_for_interrupt_injection
> > > and never call KVM_SET_REGS if it indicates
> > > "not ready"?
> > > But what if someone calls it nevertheless?
> > > Perhaps return an error from KVM_SET_REGS
> > > if exception is pending? Also KVM_SET_SREGS
> > > needs some treatment here too, as it can
> > > also be called when an exception is pending,
> > > leading to problems.
> > As I explained you can call KVM_GET_VCPU_EVENTS before calling
> > KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct
> > that was filled by KVM_GET_VCPU_EVENTS.
> > That will preserve all the cpu events.
> 
> The question is different.
> I wonder how _should_ the KVM
> API behave when someone calls
> KVM_SET_REGS/KVM_SET_SREGS

KVM_SET_REGS should not clear the pending exception.
but fixing this can break API compatibilitly if some
hypervisor (not qemu) relies on it.

Thus either a new ioctl is needed or as I said,
KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used
to preserve the events around that call as workaround.

Best regards,
	Maxim Levitsky


> while the exception is pending.
> This is currently not handled properly.
> We can add/fix the indication with
> ready_for_interrupt_injection,
> but someone will ignore that
> indication, so some handling
> (like returning an error) should
> be added.
> So what would you propose the
> KVM_SET_REGS should do if it is
> called when an exception is pending?


> The question is here because
> currently KVM_SET_REGS and
> KVM_SET_SREGS handle that differently:
> one is trying to cancel the pending
> excpetion, and the other one
> does nothing, but both are wrong.
>
Stas Sergeev July 6, 2021, 11:06 p.m. UTC | #10
07.07.2021 02:00, Maxim Levitsky пишет:
> On Wed, 2021-07-07 at 00:50 +0300, stsp wrote:
>> 06.07.2021 23:29, Maxim Levitsky пишет:
>>> On Tue, 2021-07-06 at 15:06 +0300, stsp wrote:
>>>> 06.07.2021 14:49, Maxim Levitsky пишет:
>>>>> Now about the KVM's userspace API where this is exposed:
>>>>>     
>>>>> I see now too that KVM_SET_REGS clears the pending exception.
>>>>> This is new to me and it is IMHO *wrong* thing to do.
>>>>> However I bet that someone somewhere depends on this,
>>>>> since this behavior is very old.
>>>> What alternative would you suggest?
>>>> Check for ready_for_interrupt_injection
>>>> and never call KVM_SET_REGS if it indicates
>>>> "not ready"?
>>>> But what if someone calls it nevertheless?
>>>> Perhaps return an error from KVM_SET_REGS
>>>> if exception is pending? Also KVM_SET_SREGS
>>>> needs some treatment here too, as it can
>>>> also be called when an exception is pending,
>>>> leading to problems.
>>> As I explained you can call KVM_GET_VCPU_EVENTS before calling
>>> KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct
>>> that was filled by KVM_GET_VCPU_EVENTS.
>>> That will preserve all the cpu events.
>> The question is different.
>> I wonder how _should_ the KVM
>> API behave when someone calls
>> KVM_SET_REGS/KVM_SET_SREGS
> KVM_SET_REGS should not clear the pending exception.
> but fixing this can break API compatibilitly if some
> hypervisor (not qemu) relies on it.
>
> Thus either a new ioctl is needed or as I said,
> KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used
> to preserve the events around that call as workaround.
But I don't need to preserve
events. Canceling is perfectly
fine with me because, if I inject
the interrupt at that point, the
exception will be re-triggered
anyway after interrupt handler
returns.

What I ask is how SHOULD the
KVM_SET_REGS and KVM_SET_SREGS
behave when someone (mistakenly)
calls them with the exception pending.
Should they return an error
instead of canceling exception?
Paolo Bonzini July 6, 2021, 11:45 p.m. UTC | #11
On 07/07/21 01:06, stsp wrote:
> What I ask is how SHOULD the
> KVM_SET_REGS and KVM_SET_SREGS
> behave when someone (mistakenly)
> calls them with the exception pending.
> Should they return an error
> instead of canceling exception?

In theory, KVM_SET_REGS and KVM_SET_SREGS should do nothing but set the 
value of the registers.  They not should clear either 
vcpu->arch.exception.pending or vcpu->arch.exception.injected.  I'm wary 
of changing that and breaking users of KVM, though.

In this case the problem is that, with a pending exception, you should 
not inject the interrupt (doesn't matter if it's with KVM_SET_REGS or 
KVM_INTERRUPT).  Raising a page fault is part of executing the previous 
instruction, and interrupts are only recognized at instruction 
boundaries.  Therefore, you need to test ready_for_interrupt_injection, 
and possibly use request_interrupt_window, before calling KVM_SET_REGS.

The patch you identified as the culprit does have a bug, but that's 
fixed in kvm_cpu_accept_dm_intr as I suggested in the other thread.

Paolo
Stas Sergeev July 6, 2021, 11:51 p.m. UTC | #12
07.07.2021 02:45, Paolo Bonzini пишет:
> On 07/07/21 01:06, stsp wrote:
>> What I ask is how SHOULD the
>> KVM_SET_REGS and KVM_SET_SREGS
>> behave when someone (mistakenly)
>> calls them with the exception pending.
>> Should they return an error
>> instead of canceling exception?
>
> In theory, KVM_SET_REGS and KVM_SET_SREGS should do nothing but set 
> the value of the registers.  They not should clear either 
> vcpu->arch.exception.pending or vcpu->arch.exception.injected.
Maybe they should return an
error, or do something else to
alert the user? Otherwise the
100% wrong usage gets unnoticed.
Paolo Bonzini July 7, 2021, 12:13 a.m. UTC | #13
On 07/07/21 01:51, stsp wrote:
> 07.07.2021 02:45, Paolo Bonzini пишет:
>> On 07/07/21 01:06, stsp wrote:
>>> What I ask is how SHOULD the
>>> KVM_SET_REGS and KVM_SET_SREGS
>>> behave when someone (mistakenly)
>>> calls them with the exception pending.
>>> Should they return an error
>>> instead of canceling exception?
>>
>> In theory, KVM_SET_REGS and KVM_SET_SREGS should do nothing but set 
>> the value of the registers.  They not should clear either 
>> vcpu->arch.exception.pending or vcpu->arch.exception.injected.
> Maybe they should return an
> error, or do something else to
> alert the user? Otherwise the
> 100% wrong usage gets unnoticed.

No, setting the registers means that: setting the registers.  If you set 
RSP, the new value will be used for delivering the exception.

Paolo
Jim Mattson July 7, 2021, 4:16 p.m. UTC | #14
On Tue, Jul 6, 2021 at 4:06 PM stsp <stsp2@yandex.ru> wrote:
>
> 07.07.2021 02:00, Maxim Levitsky пишет:
> > On Wed, 2021-07-07 at 00:50 +0300, stsp wrote:
> >> 06.07.2021 23:29, Maxim Levitsky пишет:
> >>> On Tue, 2021-07-06 at 15:06 +0300, stsp wrote:
> >>>> 06.07.2021 14:49, Maxim Levitsky пишет:
> >>>>> Now about the KVM's userspace API where this is exposed:
> >>>>>
> >>>>> I see now too that KVM_SET_REGS clears the pending exception.
> >>>>> This is new to me and it is IMHO *wrong* thing to do.
> >>>>> However I bet that someone somewhere depends on this,
> >>>>> since this behavior is very old.
> >>>> What alternative would you suggest?
> >>>> Check for ready_for_interrupt_injection
> >>>> and never call KVM_SET_REGS if it indicates
> >>>> "not ready"?
> >>>> But what if someone calls it nevertheless?
> >>>> Perhaps return an error from KVM_SET_REGS
> >>>> if exception is pending? Also KVM_SET_SREGS
> >>>> needs some treatment here too, as it can
> >>>> also be called when an exception is pending,
> >>>> leading to problems.
> >>> As I explained you can call KVM_GET_VCPU_EVENTS before calling
> >>> KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct
> >>> that was filled by KVM_GET_VCPU_EVENTS.
> >>> That will preserve all the cpu events.
> >> The question is different.
> >> I wonder how _should_ the KVM
> >> API behave when someone calls
> >> KVM_SET_REGS/KVM_SET_SREGS
> > KVM_SET_REGS should not clear the pending exception.
> > but fixing this can break API compatibilitly if some
> > hypervisor (not qemu) relies on it.
> >
> > Thus either a new ioctl is needed or as I said,
> > KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used
> > to preserve the events around that call as workaround.
> But I don't need to preserve
> events. Canceling is perfectly
> fine with me because, if I inject
> the interrupt at that point, the
> exception will be re-triggered
> anyway after interrupt handler
> returns.

The exception will not be re-triggered if it was a trap, because the
instruction that caused the exception will already have retired, and
RIP will have advanced.

Moreover, if the exception had any side-effects (#PF and sometimes
#DB), those side-effects have already happened, so if IDT vectoring is
cancelled, the vCPU is left in an architecturally unreachable state.

> What I ask is how SHOULD the
> KVM_SET_REGS and KVM_SET_SREGS
> behave when someone (mistakenly)
> calls them with the exception pending.
> Should they return an error
> instead of canceling exception?
Stas Sergeev July 7, 2021, 4:34 p.m. UTC | #15
07.07.2021 19:16, Jim Mattson пишет:
> On Tue, Jul 6, 2021 at 4:06 PM stsp <stsp2@yandex.ru> wrote:
>> 07.07.2021 02:00, Maxim Levitsky пишет:
>>> On Wed, 2021-07-07 at 00:50 +0300, stsp wrote:
>>>> 06.07.2021 23:29, Maxim Levitsky пишет:
>>>>> On Tue, 2021-07-06 at 15:06 +0300, stsp wrote:
>>>>>> 06.07.2021 14:49, Maxim Levitsky пишет:
>>>>>>> Now about the KVM's userspace API where this is exposed:
>>>>>>>
>>>>>>> I see now too that KVM_SET_REGS clears the pending exception.
>>>>>>> This is new to me and it is IMHO *wrong* thing to do.
>>>>>>> However I bet that someone somewhere depends on this,
>>>>>>> since this behavior is very old.
>>>>>> What alternative would you suggest?
>>>>>> Check for ready_for_interrupt_injection
>>>>>> and never call KVM_SET_REGS if it indicates
>>>>>> "not ready"?
>>>>>> But what if someone calls it nevertheless?
>>>>>> Perhaps return an error from KVM_SET_REGS
>>>>>> if exception is pending? Also KVM_SET_SREGS
>>>>>> needs some treatment here too, as it can
>>>>>> also be called when an exception is pending,
>>>>>> leading to problems.
>>>>> As I explained you can call KVM_GET_VCPU_EVENTS before calling
>>>>> KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct
>>>>> that was filled by KVM_GET_VCPU_EVENTS.
>>>>> That will preserve all the cpu events.
>>>> The question is different.
>>>> I wonder how _should_ the KVM
>>>> API behave when someone calls
>>>> KVM_SET_REGS/KVM_SET_SREGS
>>> KVM_SET_REGS should not clear the pending exception.
>>> but fixing this can break API compatibilitly if some
>>> hypervisor (not qemu) relies on it.
>>>
>>> Thus either a new ioctl is needed or as I said,
>>> KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used
>>> to preserve the events around that call as workaround.
>> But I don't need to preserve
>> events. Canceling is perfectly
>> fine with me because, if I inject
>> the interrupt at that point, the
>> exception will be re-triggered
>> anyway after interrupt handler
>> returns.
> The exception will not be re-triggered if it was a trap,
But my assumption was that
everything is atomic, except
PF with shadow page tables.
I guess you mean the cases
when the exception delivery
causes EPT fault, which is a bit
of a corner case.
Jim Mattson July 7, 2021, 4:46 p.m. UTC | #16
On Wed, Jul 7, 2021 at 9:34 AM stsp <stsp2@yandex.ru> wrote:
>
> 07.07.2021 19:16, Jim Mattson пишет:
> > On Tue, Jul 6, 2021 at 4:06 PM stsp <stsp2@yandex.ru> wrote:
> >> 07.07.2021 02:00, Maxim Levitsky пишет:
> >>> On Wed, 2021-07-07 at 00:50 +0300, stsp wrote:
> >>>> 06.07.2021 23:29, Maxim Levitsky пишет:
> >>>>> On Tue, 2021-07-06 at 15:06 +0300, stsp wrote:
> >>>>>> 06.07.2021 14:49, Maxim Levitsky пишет:
> >>>>>>> Now about the KVM's userspace API where this is exposed:
> >>>>>>>
> >>>>>>> I see now too that KVM_SET_REGS clears the pending exception.
> >>>>>>> This is new to me and it is IMHO *wrong* thing to do.
> >>>>>>> However I bet that someone somewhere depends on this,
> >>>>>>> since this behavior is very old.
> >>>>>> What alternative would you suggest?
> >>>>>> Check for ready_for_interrupt_injection
> >>>>>> and never call KVM_SET_REGS if it indicates
> >>>>>> "not ready"?
> >>>>>> But what if someone calls it nevertheless?
> >>>>>> Perhaps return an error from KVM_SET_REGS
> >>>>>> if exception is pending? Also KVM_SET_SREGS
> >>>>>> needs some treatment here too, as it can
> >>>>>> also be called when an exception is pending,
> >>>>>> leading to problems.
> >>>>> As I explained you can call KVM_GET_VCPU_EVENTS before calling
> >>>>> KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct
> >>>>> that was filled by KVM_GET_VCPU_EVENTS.
> >>>>> That will preserve all the cpu events.
> >>>> The question is different.
> >>>> I wonder how _should_ the KVM
> >>>> API behave when someone calls
> >>>> KVM_SET_REGS/KVM_SET_SREGS
> >>> KVM_SET_REGS should not clear the pending exception.
> >>> but fixing this can break API compatibilitly if some
> >>> hypervisor (not qemu) relies on it.
> >>>
> >>> Thus either a new ioctl is needed or as I said,
> >>> KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used
> >>> to preserve the events around that call as workaround.
> >> But I don't need to preserve
> >> events. Canceling is perfectly
> >> fine with me because, if I inject
> >> the interrupt at that point, the
> >> exception will be re-triggered
> >> anyway after interrupt handler
> >> returns.
> > The exception will not be re-triggered if it was a trap,
> But my assumption was that
> everything is atomic, except
> PF with shadow page tables.
> I guess you mean the cases
> when the exception delivery
> causes EPT fault, which is a bit
> of a corner case.

No, that's not what I mean. Consider the #DB exception, which is
intercepted in all configurations to circumvent a DoS attack. Some #DB
exceptions modify DR6. Once the exception has been 'injected,' DR6 has
already been modified. If you do not complete the injection, but you
deliver an interrupt instead, then the interrupt handler can see a DR6
value that is architecturally impossible.
Stas Sergeev July 7, 2021, 4:58 p.m. UTC | #17
07.07.2021 19:46, Jim Mattson пишет:
> On Wed, Jul 7, 2021 at 9:34 AM stsp <stsp2@yandex.ru> wrote:
>> 07.07.2021 19:16, Jim Mattson пишет:
>>> On Tue, Jul 6, 2021 at 4:06 PM stsp <stsp2@yandex.ru> wrote:
>>>> 07.07.2021 02:00, Maxim Levitsky пишет:
>>>>> On Wed, 2021-07-07 at 00:50 +0300, stsp wrote:
>>>>>> 06.07.2021 23:29, Maxim Levitsky пишет:
>>>>>>> On Tue, 2021-07-06 at 15:06 +0300, stsp wrote:
>>>>>>>> 06.07.2021 14:49, Maxim Levitsky пишет:
>>>>>>>>> Now about the KVM's userspace API where this is exposed:
>>>>>>>>>
>>>>>>>>> I see now too that KVM_SET_REGS clears the pending exception.
>>>>>>>>> This is new to me and it is IMHO *wrong* thing to do.
>>>>>>>>> However I bet that someone somewhere depends on this,
>>>>>>>>> since this behavior is very old.
>>>>>>>> What alternative would you suggest?
>>>>>>>> Check for ready_for_interrupt_injection
>>>>>>>> and never call KVM_SET_REGS if it indicates
>>>>>>>> "not ready"?
>>>>>>>> But what if someone calls it nevertheless?
>>>>>>>> Perhaps return an error from KVM_SET_REGS
>>>>>>>> if exception is pending? Also KVM_SET_SREGS
>>>>>>>> needs some treatment here too, as it can
>>>>>>>> also be called when an exception is pending,
>>>>>>>> leading to problems.
>>>>>>> As I explained you can call KVM_GET_VCPU_EVENTS before calling
>>>>>>> KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct
>>>>>>> that was filled by KVM_GET_VCPU_EVENTS.
>>>>>>> That will preserve all the cpu events.
>>>>>> The question is different.
>>>>>> I wonder how _should_ the KVM
>>>>>> API behave when someone calls
>>>>>> KVM_SET_REGS/KVM_SET_SREGS
>>>>> KVM_SET_REGS should not clear the pending exception.
>>>>> but fixing this can break API compatibilitly if some
>>>>> hypervisor (not qemu) relies on it.
>>>>>
>>>>> Thus either a new ioctl is needed or as I said,
>>>>> KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used
>>>>> to preserve the events around that call as workaround.
>>>> But I don't need to preserve
>>>> events. Canceling is perfectly
>>>> fine with me because, if I inject
>>>> the interrupt at that point, the
>>>> exception will be re-triggered
>>>> anyway after interrupt handler
>>>> returns.
>>> The exception will not be re-triggered if it was a trap,
>> But my assumption was that
>> everything is atomic, except
>> PF with shadow page tables.
>> I guess you mean the cases
>> when the exception delivery
>> causes EPT fault, which is a bit
>> of a corner case.
> No, that's not what I mean. Consider the #DB exception, which is
> intercepted in all configurations to circumvent a DoS attack. Some #DB
> exceptions modify DR6. Once the exception has been 'injected,' DR6 has
> already been modified. If you do not complete the injection, but you
> deliver an interrupt instead, then the interrupt handler can see a DR6
> value that is architecturally impossible.

Yes, I understand that part.
It seems to be called the "exception
payload" in kvm sources, and
includes also CR2 for #PF.
So of course if there are many
non-atomic cases, rather than
just one, then there are no doubts
we need to check ready_for_injection.
Its just that I was looking at that
non-atomicity as a kvm's quirk,
but its probably the fundamental
part of vmx instead.
There is still the problem that
KVM_SET_REGS cancels the
injection and KVM_SET_SREGS not.
But I realize you may want to leave
it that way for compatibility.
Jim Mattson July 7, 2021, 5:39 p.m. UTC | #18
On Wed, Jul 7, 2021 at 9:58 AM stsp <stsp2@yandex.ru> wrote:
>
> 07.07.2021 19:46, Jim Mattson пишет:
> > On Wed, Jul 7, 2021 at 9:34 AM stsp <stsp2@yandex.ru> wrote:
> >> 07.07.2021 19:16, Jim Mattson пишет:
> >>> On Tue, Jul 6, 2021 at 4:06 PM stsp <stsp2@yandex.ru> wrote:
> >>>> 07.07.2021 02:00, Maxim Levitsky пишет:
> >>>>> On Wed, 2021-07-07 at 00:50 +0300, stsp wrote:
> >>>>>> 06.07.2021 23:29, Maxim Levitsky пишет:
> >>>>>>> On Tue, 2021-07-06 at 15:06 +0300, stsp wrote:
> >>>>>>>> 06.07.2021 14:49, Maxim Levitsky пишет:
> >>>>>>>>> Now about the KVM's userspace API where this is exposed:
> >>>>>>>>>
> >>>>>>>>> I see now too that KVM_SET_REGS clears the pending exception.
> >>>>>>>>> This is new to me and it is IMHO *wrong* thing to do.
> >>>>>>>>> However I bet that someone somewhere depends on this,
> >>>>>>>>> since this behavior is very old.
> >>>>>>>> What alternative would you suggest?
> >>>>>>>> Check for ready_for_interrupt_injection
> >>>>>>>> and never call KVM_SET_REGS if it indicates
> >>>>>>>> "not ready"?
> >>>>>>>> But what if someone calls it nevertheless?
> >>>>>>>> Perhaps return an error from KVM_SET_REGS
> >>>>>>>> if exception is pending? Also KVM_SET_SREGS
> >>>>>>>> needs some treatment here too, as it can
> >>>>>>>> also be called when an exception is pending,
> >>>>>>>> leading to problems.
> >>>>>>> As I explained you can call KVM_GET_VCPU_EVENTS before calling
> >>>>>>> KVM_SET_REGS and then call KVM_SET_VCPU_EVENTS with the struct
> >>>>>>> that was filled by KVM_GET_VCPU_EVENTS.
> >>>>>>> That will preserve all the cpu events.
> >>>>>> The question is different.
> >>>>>> I wonder how _should_ the KVM
> >>>>>> API behave when someone calls
> >>>>>> KVM_SET_REGS/KVM_SET_SREGS
> >>>>> KVM_SET_REGS should not clear the pending exception.
> >>>>> but fixing this can break API compatibilitly if some
> >>>>> hypervisor (not qemu) relies on it.
> >>>>>
> >>>>> Thus either a new ioctl is needed or as I said,
> >>>>> KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS can be used
> >>>>> to preserve the events around that call as workaround.
> >>>> But I don't need to preserve
> >>>> events. Canceling is perfectly
> >>>> fine with me because, if I inject
> >>>> the interrupt at that point, the
> >>>> exception will be re-triggered
> >>>> anyway after interrupt handler
> >>>> returns.
> >>> The exception will not be re-triggered if it was a trap,
> >> But my assumption was that
> >> everything is atomic, except
> >> PF with shadow page tables.
> >> I guess you mean the cases
> >> when the exception delivery
> >> causes EPT fault, which is a bit
> >> of a corner case.
> > No, that's not what I mean. Consider the #DB exception, which is
> > intercepted in all configurations to circumvent a DoS attack. Some #DB
> > exceptions modify DR6. Once the exception has been 'injected,' DR6 has
> > already been modified. If you do not complete the injection, but you
> > deliver an interrupt instead, then the interrupt handler can see a DR6
> > value that is architecturally impossible.
>
> Yes, I understand that part.
> It seems to be called the "exception
> payload" in kvm sources, and
> includes also CR2 for #PF.
> So of course if there are many
> non-atomic cases, rather than
> just one, then there are no doubts
> we need to check ready_for_injection.
> Its just that I was looking at that
> non-atomicity as a kvm's quirk,
> but its probably the fundamental
> part of vmx instead.

You could argue that it's a Linux quirk. Without EINTR, this probably
wouldn't be an issue.

> There is still the problem that
> KVM_SET_REGS cancels the
> injection and KVM_SET_SREGS not.
> But I realize you may want to leave
> it that way for compatibility.

Sadly, broken userspace APIs often have to remain broken. The "fix" is
probably to introduce KVM_SET_REGS2.
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e0f4a46649d7..d1026e9216e4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9450,7 +9450,11 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 cancel_injection:
 	if (req_immediate_exit)
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
-	static_call(kvm_x86_cancel_injection)(vcpu);
+	if (vcpu->arch.exception.injected) {
+		static_call(kvm_x86_cancel_injection)(vcpu);
+		vcpu->arch.exception.injected = false;
+		vcpu->arch.exception.pending = true;
+	}
 	if (unlikely(vcpu->arch.apic_attention))
 		kvm_lapic_sync_from_vapic(vcpu);
 out:
@@ -10077,6 +10081,8 @@  static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 		pr_debug("Set back pending irq %d\n", pending_vec);
 	}
 
+	vcpu->arch.exception.pending = false;
+
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 
 	ret = 0;