Message ID | 20210315221020.661693-3-mlevitsk@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: my debug patch queue | expand |
On Tue, Mar 16, 2021, Maxim Levitsky wrote: > This change greatly helps with two issues: > > * Resuming from a breakpoint is much more reliable. > > When resuming execution from a breakpoint, with interrupts enabled, more often > than not, KVM would inject an interrupt and make the CPU jump immediately to > the interrupt handler and eventually return to the breakpoint, to trigger it > again. > > From the user point of view it looks like the CPU never executed a > single instruction and in some cases that can even prevent forward progress, > for example, when the breakpoint is placed by an automated script > (e.g lx-symbols), which does something in response to the breakpoint and then > continues the guest automatically. > If the script execution takes enough time for another interrupt to arrive, > the guest will be stuck on the same breakpoint RIP forever. > > * Normal single stepping is much more predictable, since it won't land the > debugger into an interrupt handler, so it is much more usable. > > (If entry to an interrupt handler is desired, the user can still place a > breakpoint at it and resume the guest, which won't activate this workaround > and let the gdb still stop at the interrupt handler) > > Since this change is only active when guest is debugged, it won't affect > KVM running normal 'production' VMs. > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> > Tested-by: Stefano Garzarella <sgarzare@redhat.com> > --- > arch/x86/kvm/x86.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index a9d95f90a0487..b75d990fcf12b 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit > can_inject = false; > } > > + /* > + * Don't inject interrupts while single stepping to make guest debug easier > + */ > + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) > + return; Is this something userspace can deal with? E.g. disable IRQs and/or set NMI blocking at the start of single-stepping, unwind at the end? Deviating this far from architectural behavior will end in tears at some point. > + > /* > * Finally, inject interrupt events. If an event cannot be injected > * due to architectural conditions (e.g. IF=0) a window-open exit > -- > 2.26.2 >
On 16.03.21 00:37, Sean Christopherson wrote: > On Tue, Mar 16, 2021, Maxim Levitsky wrote: >> This change greatly helps with two issues: >> >> * Resuming from a breakpoint is much more reliable. >> >> When resuming execution from a breakpoint, with interrupts enabled, more often >> than not, KVM would inject an interrupt and make the CPU jump immediately to >> the interrupt handler and eventually return to the breakpoint, to trigger it >> again. >> >> From the user point of view it looks like the CPU never executed a >> single instruction and in some cases that can even prevent forward progress, >> for example, when the breakpoint is placed by an automated script >> (e.g lx-symbols), which does something in response to the breakpoint and then >> continues the guest automatically. >> If the script execution takes enough time for another interrupt to arrive, >> the guest will be stuck on the same breakpoint RIP forever. >> >> * Normal single stepping is much more predictable, since it won't land the >> debugger into an interrupt handler, so it is much more usable. >> >> (If entry to an interrupt handler is desired, the user can still place a >> breakpoint at it and resume the guest, which won't activate this workaround >> and let the gdb still stop at the interrupt handler) >> >> Since this change is only active when guest is debugged, it won't affect >> KVM running normal 'production' VMs. >> >> >> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> >> Tested-by: Stefano Garzarella <sgarzare@redhat.com> >> --- >> arch/x86/kvm/x86.c | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index a9d95f90a0487..b75d990fcf12b 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit >> can_inject = false; >> } >> >> + /* >> + * Don't inject interrupts while single stepping to make guest debug easier >> + */ >> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) >> + return; > > Is this something userspace can deal with? E.g. disable IRQs and/or set NMI > blocking at the start of single-stepping, unwind at the end? Deviating this far > from architectural behavior will end in tears at some point. > Does this happen to address this suspicious workaround in the kernel? /* * The kernel doesn't use TF single-step outside of: * * - Kprobes, consumed through kprobe_debug_handler() * - KGDB, consumed through notify_debug() * * So if we get here with DR_STEP set, something is wonky. * * A known way to trigger this is through QEMU's GDB stub, * which leaks #DB into the guest and causes IST recursion. */ if (WARN_ON_ONCE(dr6 & DR_STEP)) regs->flags &= ~X86_EFLAGS_TF; (arch/x86/kernel/traps.c, exc_debug_kernel) I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh, yeah, question to myself as well, dancing around broken guest debugging for a long time while trying to fix other issues... Jan >> + >> /* >> * Finally, inject interrupt events. If an event cannot be injected >> * due to architectural conditions (e.g. IF=0) a window-open exit >> -- >> 2.26.2 >>
On Mon, 2021-03-15 at 16:37 -0700, Sean Christopherson wrote: > On Tue, Mar 16, 2021, Maxim Levitsky wrote: > > This change greatly helps with two issues: > > > > * Resuming from a breakpoint is much more reliable. > > > > When resuming execution from a breakpoint, with interrupts enabled, more often > > than not, KVM would inject an interrupt and make the CPU jump immediately to > > the interrupt handler and eventually return to the breakpoint, to trigger it > > again. > > > > From the user point of view it looks like the CPU never executed a > > single instruction and in some cases that can even prevent forward progress, > > for example, when the breakpoint is placed by an automated script > > (e.g lx-symbols), which does something in response to the breakpoint and then > > continues the guest automatically. > > If the script execution takes enough time for another interrupt to arrive, > > the guest will be stuck on the same breakpoint RIP forever. > > > > * Normal single stepping is much more predictable, since it won't land the > > debugger into an interrupt handler, so it is much more usable. > > > > (If entry to an interrupt handler is desired, the user can still place a > > breakpoint at it and resume the guest, which won't activate this workaround > > and let the gdb still stop at the interrupt handler) > > > > Since this change is only active when guest is debugged, it won't affect > > KVM running normal 'production' VMs. > > > > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> > > Tested-by: Stefano Garzarella <sgarzare@redhat.com> > > --- > > arch/x86/kvm/x86.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index a9d95f90a0487..b75d990fcf12b 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit > > can_inject = false; > > } > > > > + /* > > + * Don't inject interrupts while single stepping to make guest debug easier > > + */ > > + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) > > + return; > > Is this something userspace can deal with? E.g. disable IRQs and/or set NMI > blocking at the start of single-stepping, unwind at the end? Deviating this far > from architectural behavior will end in tears at some point. I don't worry about NMI, but for IRQs, userspace can clear EFLAGS.IF, but that can be messy to unwind, if an instruction that clears the interrupt flag was single stepped over. There is also notion of interrupt shadow but it also is reserved for things like delaying interrupts for one cycle after sti, and such. IMHO KVM_GUESTDBG_SINGLESTEP is already non architectural feature (userspace basically tell the KVM to single step the guest but it doesn't set TF flag or something like that), so changing its definition shouldn't be a problem. If you worry about some automated script breaking due to the change, (I expect that KVM_GUESTDBG_SINGLESTEP is mostly used manually, especially since single stepping is never 100% reliable due to various issues like that), I can add another flag to it which will block all the interrupts. (like say KVM_GUESTDBG_BLOCKEVENTS). In fact qemu already has single step flags, enabled over special qemu gdb extension 'maintenance packet qqemu.sstepbits' Those single step flags allow to disable interrupts and qemu timers during the single stepping, (and both modes are enabled by default) However kvm code in qemu ignores these bits. What do you think? Best regards, Maxim Levitsky > > > + > > /* > > * Finally, inject interrupt events. If an event cannot be injected > > * due to architectural conditions (e.g. IF=0) a window-open exit > > -- > > 2.26.2 > >
On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote: > On 16.03.21 00:37, Sean Christopherson wrote: > > On Tue, Mar 16, 2021, Maxim Levitsky wrote: > > > This change greatly helps with two issues: > > > > > > * Resuming from a breakpoint is much more reliable. > > > > > > When resuming execution from a breakpoint, with interrupts enabled, more often > > > than not, KVM would inject an interrupt and make the CPU jump immediately to > > > the interrupt handler and eventually return to the breakpoint, to trigger it > > > again. > > > > > > From the user point of view it looks like the CPU never executed a > > > single instruction and in some cases that can even prevent forward progress, > > > for example, when the breakpoint is placed by an automated script > > > (e.g lx-symbols), which does something in response to the breakpoint and then > > > continues the guest automatically. > > > If the script execution takes enough time for another interrupt to arrive, > > > the guest will be stuck on the same breakpoint RIP forever. > > > > > > * Normal single stepping is much more predictable, since it won't land the > > > debugger into an interrupt handler, so it is much more usable. > > > > > > (If entry to an interrupt handler is desired, the user can still place a > > > breakpoint at it and resume the guest, which won't activate this workaround > > > and let the gdb still stop at the interrupt handler) > > > > > > Since this change is only active when guest is debugged, it won't affect > > > KVM running normal 'production' VMs. > > > > > > > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> > > > Tested-by: Stefano Garzarella <sgarzare@redhat.com> > > > --- > > > arch/x86/kvm/x86.c | 6 ++++++ > > > 1 file changed, 6 insertions(+) > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index a9d95f90a0487..b75d990fcf12b 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit > > > can_inject = false; > > > } > > > > > > + /* > > > + * Don't inject interrupts while single stepping to make guest debug easier > > > + */ > > > + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) > > > + return; > > > > Is this something userspace can deal with? E.g. disable IRQs and/or set NMI > > blocking at the start of single-stepping, unwind at the end? Deviating this far > > from architectural behavior will end in tears at some point. > > > > Does this happen to address this suspicious workaround in the kernel? > > /* > * The kernel doesn't use TF single-step outside of: > * > * - Kprobes, consumed through kprobe_debug_handler() > * - KGDB, consumed through notify_debug() > * > * So if we get here with DR_STEP set, something is wonky. > * > * A known way to trigger this is through QEMU's GDB stub, > * which leaks #DB into the guest and causes IST recursion. > */ > if (WARN_ON_ONCE(dr6 & DR_STEP)) > regs->flags &= ~X86_EFLAGS_TF; > > (arch/x86/kernel/traps.c, exc_debug_kernel) > > I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh, > yeah, question to myself as well, dancing around broken guest debugging > for a long time while trying to fix other issues... To be honest I didn't see that warning even once, but I can imagine KVM leaking #DB due to bugs in that code. That area historically didn't receive much attention since it can only be triggered by KVM_GET/SET_GUEST_DEBUG which isn't used in production. The only issue that I on the other hand did see which is mostly gdb fault is that it fails to remove a software breakpoint when resuming over it, if that breakpoint's python handler messes up with gdb's symbols, which is what lx-symbols does. And that despite the fact that lx-symbol doesn't mess with the object (that is the kernel) where the breakpoint is defined. Just adding/removing one symbol file is enough to trigger this issue. Since lx-symbols already works this around when it reloads all symbols, I extended that workaround to happen also when loading/unloading only a single symbol file. Best regards, Maxim Levitsky > > Jan > > > > + > > > /* > > > * Finally, inject interrupt events. If an event cannot be injected > > > * due to architectural conditions (e.g. IF=0) a window-open exit > > > -- > > > 2.26.2 > > >
On 16.03.21 11:59, Maxim Levitsky wrote: > On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote: >> On 16.03.21 00:37, Sean Christopherson wrote: >>> On Tue, Mar 16, 2021, Maxim Levitsky wrote: >>>> This change greatly helps with two issues: >>>> >>>> * Resuming from a breakpoint is much more reliable. >>>> >>>> When resuming execution from a breakpoint, with interrupts enabled, more often >>>> than not, KVM would inject an interrupt and make the CPU jump immediately to >>>> the interrupt handler and eventually return to the breakpoint, to trigger it >>>> again. >>>> >>>> From the user point of view it looks like the CPU never executed a >>>> single instruction and in some cases that can even prevent forward progress, >>>> for example, when the breakpoint is placed by an automated script >>>> (e.g lx-symbols), which does something in response to the breakpoint and then >>>> continues the guest automatically. >>>> If the script execution takes enough time for another interrupt to arrive, >>>> the guest will be stuck on the same breakpoint RIP forever. >>>> >>>> * Normal single stepping is much more predictable, since it won't land the >>>> debugger into an interrupt handler, so it is much more usable. >>>> >>>> (If entry to an interrupt handler is desired, the user can still place a >>>> breakpoint at it and resume the guest, which won't activate this workaround >>>> and let the gdb still stop at the interrupt handler) >>>> >>>> Since this change is only active when guest is debugged, it won't affect >>>> KVM running normal 'production' VMs. >>>> >>>> >>>> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> >>>> Tested-by: Stefano Garzarella <sgarzare@redhat.com> >>>> --- >>>> arch/x86/kvm/x86.c | 6 ++++++ >>>> 1 file changed, 6 insertions(+) >>>> >>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>>> index a9d95f90a0487..b75d990fcf12b 100644 >>>> --- a/arch/x86/kvm/x86.c >>>> +++ b/arch/x86/kvm/x86.c >>>> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit >>>> can_inject = false; >>>> } >>>> >>>> + /* >>>> + * Don't inject interrupts while single stepping to make guest debug easier >>>> + */ >>>> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) >>>> + return; >>> >>> Is this something userspace can deal with? E.g. disable IRQs and/or set NMI >>> blocking at the start of single-stepping, unwind at the end? Deviating this far >>> from architectural behavior will end in tears at some point. >>> >> >> Does this happen to address this suspicious workaround in the kernel? >> >> /* >> * The kernel doesn't use TF single-step outside of: >> * >> * - Kprobes, consumed through kprobe_debug_handler() >> * - KGDB, consumed through notify_debug() >> * >> * So if we get here with DR_STEP set, something is wonky. >> * >> * A known way to trigger this is through QEMU's GDB stub, >> * which leaks #DB into the guest and causes IST recursion. >> */ >> if (WARN_ON_ONCE(dr6 & DR_STEP)) >> regs->flags &= ~X86_EFLAGS_TF; >> >> (arch/x86/kernel/traps.c, exc_debug_kernel) >> >> I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh, >> yeah, question to myself as well, dancing around broken guest debugging >> for a long time while trying to fix other issues... > > To be honest I didn't see that warning even once, but I can imagine KVM > leaking #DB due to bugs in that code. That area historically didn't receive > much attention since it can only be triggered by > KVM_GET/SET_GUEST_DEBUG which isn't used in production. I've triggered it recently while debugging a guest, that's why I got aware of the code path. Long ago, all this used to work (soft BPs, single-stepping etc.) > > The only issue that I on the other hand did > see which is mostly gdb fault is that it fails to remove a software breakpoint > when resuming over it, if that breakpoint's python handler messes up > with gdb's symbols, which is what lx-symbols does. > > And that despite the fact that lx-symbol doesn't mess with the object > (that is the kernel) where the breakpoint is defined. > > Just adding/removing one symbol file is enough to trigger this issue. > > Since lx-symbols already works this around when it reloads all symbols, > I extended that workaround to happen also when loading/unloading > only a single symbol file. You have no issue with interactive debugging when NOT using gdb scripts / lx-symbol? Jan
On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote: > On 16.03.21 11:59, Maxim Levitsky wrote: > > On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote: > > > On 16.03.21 00:37, Sean Christopherson wrote: > > > > On Tue, Mar 16, 2021, Maxim Levitsky wrote: > > > > > This change greatly helps with two issues: > > > > > > > > > > * Resuming from a breakpoint is much more reliable. > > > > > > > > > > When resuming execution from a breakpoint, with interrupts enabled, more often > > > > > than not, KVM would inject an interrupt and make the CPU jump immediately to > > > > > the interrupt handler and eventually return to the breakpoint, to trigger it > > > > > again. > > > > > > > > > > From the user point of view it looks like the CPU never executed a > > > > > single instruction and in some cases that can even prevent forward progress, > > > > > for example, when the breakpoint is placed by an automated script > > > > > (e.g lx-symbols), which does something in response to the breakpoint and then > > > > > continues the guest automatically. > > > > > If the script execution takes enough time for another interrupt to arrive, > > > > > the guest will be stuck on the same breakpoint RIP forever. > > > > > > > > > > * Normal single stepping is much more predictable, since it won't land the > > > > > debugger into an interrupt handler, so it is much more usable. > > > > > > > > > > (If entry to an interrupt handler is desired, the user can still place a > > > > > breakpoint at it and resume the guest, which won't activate this workaround > > > > > and let the gdb still stop at the interrupt handler) > > > > > > > > > > Since this change is only active when guest is debugged, it won't affect > > > > > KVM running normal 'production' VMs. > > > > > > > > > > > > > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> > > > > > Tested-by: Stefano Garzarella <sgarzare@redhat.com> > > > > > --- > > > > > arch/x86/kvm/x86.c | 6 ++++++ > > > > > 1 file changed, 6 insertions(+) > > > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > > index a9d95f90a0487..b75d990fcf12b 100644 > > > > > --- a/arch/x86/kvm/x86.c > > > > > +++ b/arch/x86/kvm/x86.c > > > > > @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit > > > > > can_inject = false; > > > > > } > > > > > > > > > > + /* > > > > > + * Don't inject interrupts while single stepping to make guest debug easier > > > > > + */ > > > > > + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) > > > > > + return; > > > > > > > > Is this something userspace can deal with? E.g. disable IRQs and/or set NMI > > > > blocking at the start of single-stepping, unwind at the end? Deviating this far > > > > from architectural behavior will end in tears at some point. > > > > > > > > > > Does this happen to address this suspicious workaround in the kernel? > > > > > > /* > > > * The kernel doesn't use TF single-step outside of: > > > * > > > * - Kprobes, consumed through kprobe_debug_handler() > > > * - KGDB, consumed through notify_debug() > > > * > > > * So if we get here with DR_STEP set, something is wonky. > > > * > > > * A known way to trigger this is through QEMU's GDB stub, > > > * which leaks #DB into the guest and causes IST recursion. > > > */ > > > if (WARN_ON_ONCE(dr6 & DR_STEP)) > > > regs->flags &= ~X86_EFLAGS_TF; > > > > > > (arch/x86/kernel/traps.c, exc_debug_kernel) > > > > > > I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh, > > > yeah, question to myself as well, dancing around broken guest debugging > > > for a long time while trying to fix other issues... > > > > To be honest I didn't see that warning even once, but I can imagine KVM > > leaking #DB due to bugs in that code. That area historically didn't receive > > much attention since it can only be triggered by > > KVM_GET/SET_GUEST_DEBUG which isn't used in production. > > I've triggered it recently while debugging a guest, that's why I got > aware of the code path. Long ago, all this used to work (soft BPs, > single-stepping etc.) > > > The only issue that I on the other hand did > > see which is mostly gdb fault is that it fails to remove a software breakpoint > > when resuming over it, if that breakpoint's python handler messes up > > with gdb's symbols, which is what lx-symbols does. > > > > And that despite the fact that lx-symbol doesn't mess with the object > > (that is the kernel) where the breakpoint is defined. > > > > Just adding/removing one symbol file is enough to trigger this issue. > > > > Since lx-symbols already works this around when it reloads all symbols, > > I extended that workaround to happen also when loading/unloading > > only a single symbol file. > > You have no issue with interactive debugging when NOT using gdb scripts > / lx-symbol? To be honest I don't use guest debugging that much, so I probably missed some issues. Now that I fixed lx-symbols though I'll probably use guest debugging much more. I will keep an eye on any issues that I find. The main push to fix lx-symbols actually came from me wanting to understand if there is something broken with KVM's guest debugging knowing that lx-symbols crashes the guest when module is loaded after lx-symbols was executed. That lx-symbols related guest crash I traced to issue with gdb as I explained, and the lack of blocking of the interrupts on single step is not a bug but more a missing feature that should be implemented to make single step easier to use. Another issue which isn't a bug is that you can't place a software breakpoint if kernel is not loaded (since there is no code in memory) or if the kernel haven't done basic paging initialization (since there is no paging yet to know where to place the breakpoint). Hardware breakpoints work for this fine though. So in summary I haven't found any major issues with KVM's guest debug yet. If I do notice issues with guest debug, I will try to isolate and debug them. For the issue that you mentioned, do you have a way to reproduce it? Best regards, Maxim Levitsky > > Jan >
On 16.03.21 13:34, Maxim Levitsky wrote: > On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote: >> On 16.03.21 11:59, Maxim Levitsky wrote: >>> On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote: >>>> On 16.03.21 00:37, Sean Christopherson wrote: >>>>> On Tue, Mar 16, 2021, Maxim Levitsky wrote: >>>>>> This change greatly helps with two issues: >>>>>> >>>>>> * Resuming from a breakpoint is much more reliable. >>>>>> >>>>>> When resuming execution from a breakpoint, with interrupts enabled, more often >>>>>> than not, KVM would inject an interrupt and make the CPU jump immediately to >>>>>> the interrupt handler and eventually return to the breakpoint, to trigger it >>>>>> again. >>>>>> >>>>>> From the user point of view it looks like the CPU never executed a >>>>>> single instruction and in some cases that can even prevent forward progress, >>>>>> for example, when the breakpoint is placed by an automated script >>>>>> (e.g lx-symbols), which does something in response to the breakpoint and then >>>>>> continues the guest automatically. >>>>>> If the script execution takes enough time for another interrupt to arrive, >>>>>> the guest will be stuck on the same breakpoint RIP forever. >>>>>> >>>>>> * Normal single stepping is much more predictable, since it won't land the >>>>>> debugger into an interrupt handler, so it is much more usable. >>>>>> >>>>>> (If entry to an interrupt handler is desired, the user can still place a >>>>>> breakpoint at it and resume the guest, which won't activate this workaround >>>>>> and let the gdb still stop at the interrupt handler) >>>>>> >>>>>> Since this change is only active when guest is debugged, it won't affect >>>>>> KVM running normal 'production' VMs. >>>>>> >>>>>> >>>>>> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> >>>>>> Tested-by: Stefano Garzarella <sgarzare@redhat.com> >>>>>> --- >>>>>> arch/x86/kvm/x86.c | 6 ++++++ >>>>>> 1 file changed, 6 insertions(+) >>>>>> >>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>>>>> index a9d95f90a0487..b75d990fcf12b 100644 >>>>>> --- a/arch/x86/kvm/x86.c >>>>>> +++ b/arch/x86/kvm/x86.c >>>>>> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit >>>>>> can_inject = false; >>>>>> } >>>>>> >>>>>> + /* >>>>>> + * Don't inject interrupts while single stepping to make guest debug easier >>>>>> + */ >>>>>> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) >>>>>> + return; >>>>> >>>>> Is this something userspace can deal with? E.g. disable IRQs and/or set NMI >>>>> blocking at the start of single-stepping, unwind at the end? Deviating this far >>>>> from architectural behavior will end in tears at some point. >>>>> >>>> >>>> Does this happen to address this suspicious workaround in the kernel? >>>> >>>> /* >>>> * The kernel doesn't use TF single-step outside of: >>>> * >>>> * - Kprobes, consumed through kprobe_debug_handler() >>>> * - KGDB, consumed through notify_debug() >>>> * >>>> * So if we get here with DR_STEP set, something is wonky. >>>> * >>>> * A known way to trigger this is through QEMU's GDB stub, >>>> * which leaks #DB into the guest and causes IST recursion. >>>> */ >>>> if (WARN_ON_ONCE(dr6 & DR_STEP)) >>>> regs->flags &= ~X86_EFLAGS_TF; >>>> >>>> (arch/x86/kernel/traps.c, exc_debug_kernel) >>>> >>>> I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh, >>>> yeah, question to myself as well, dancing around broken guest debugging >>>> for a long time while trying to fix other issues... >>> >>> To be honest I didn't see that warning even once, but I can imagine KVM >>> leaking #DB due to bugs in that code. That area historically didn't receive >>> much attention since it can only be triggered by >>> KVM_GET/SET_GUEST_DEBUG which isn't used in production. >> >> I've triggered it recently while debugging a guest, that's why I got >> aware of the code path. Long ago, all this used to work (soft BPs, >> single-stepping etc.) >> >>> The only issue that I on the other hand did >>> see which is mostly gdb fault is that it fails to remove a software breakpoint >>> when resuming over it, if that breakpoint's python handler messes up >>> with gdb's symbols, which is what lx-symbols does. >>> >>> And that despite the fact that lx-symbol doesn't mess with the object >>> (that is the kernel) where the breakpoint is defined. >>> >>> Just adding/removing one symbol file is enough to trigger this issue. >>> >>> Since lx-symbols already works this around when it reloads all symbols, >>> I extended that workaround to happen also when loading/unloading >>> only a single symbol file. >> >> You have no issue with interactive debugging when NOT using gdb scripts >> / lx-symbol? > > To be honest I don't use guest debugging that much, > so I probably missed some issues. > > Now that I fixed lx-symbols though I'll probably use > guest debugging much more. > I will keep an eye on any issues that I find. > > The main push to fix lx-symbols actually came > from me wanting to understand if there is something > broken with KVM's guest debugging knowing that > lx-symbols crashes the guest when module is loaded > after lx-symbols was executed. > > That lx-symbols related guest crash I traced to issue > with gdb as I explained, and the lack of blocking of the interrupts > on single step is not a bug but more a missing feature > that should be implemented to make single step easier to use. Again, this used to work fine. But maybe this patch can change the picture by avoid that the unavoidable short TF leakage into the guest escalates beyond the single instruction to step over. > > Another issue which isn't a bug is that you can't place a software > breakpoint if kernel is not loaded (since there is no code in memory) > or if the kernel haven't done basic paging initialization > (since there is no paging yet to know where to place the breakpoint). > Hardware breakpoints work for this fine though. > > So in summary I haven't found any major issues with KVM's guest debug > yet. > > If I do notice issues with guest debug, I will try to isolate > and debug them. > For the issue that you mentioned, do you have a way to reproduce it? I need to spend some time in factoring out a clean test setup, will come back to you. I'm always pushing this to the back - and then grumble when hitting it while debugging something urgent. Your patch is a nice reason to do this systematically now. Jan
On Tue, 2021-03-16 at 14:46 +0100, Jan Kiszka wrote: > On 16.03.21 13:34, Maxim Levitsky wrote: > > On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote: > > > On 16.03.21 11:59, Maxim Levitsky wrote: > > > > On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote: > > > > > On 16.03.21 00:37, Sean Christopherson wrote: > > > > > > On Tue, Mar 16, 2021, Maxim Levitsky wrote: > > > > > > > This change greatly helps with two issues: > > > > > > > > > > > > > > * Resuming from a breakpoint is much more reliable. > > > > > > > > > > > > > > When resuming execution from a breakpoint, with interrupts enabled, more often > > > > > > > than not, KVM would inject an interrupt and make the CPU jump immediately to > > > > > > > the interrupt handler and eventually return to the breakpoint, to trigger it > > > > > > > again. > > > > > > > > > > > > > > From the user point of view it looks like the CPU never executed a > > > > > > > single instruction and in some cases that can even prevent forward progress, > > > > > > > for example, when the breakpoint is placed by an automated script > > > > > > > (e.g lx-symbols), which does something in response to the breakpoint and then > > > > > > > continues the guest automatically. > > > > > > > If the script execution takes enough time for another interrupt to arrive, > > > > > > > the guest will be stuck on the same breakpoint RIP forever. > > > > > > > > > > > > > > * Normal single stepping is much more predictable, since it won't land the > > > > > > > debugger into an interrupt handler, so it is much more usable. > > > > > > > > > > > > > > (If entry to an interrupt handler is desired, the user can still place a > > > > > > > breakpoint at it and resume the guest, which won't activate this workaround > > > > > > > and let the gdb still stop at the interrupt handler) > > > > > > > > > > > > > > Since this change is only active when guest is debugged, it won't affect > > > > > > > KVM running normal 'production' VMs. > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> > > > > > > > Tested-by: Stefano Garzarella <sgarzare@redhat.com> > > > > > > > --- > > > > > > > arch/x86/kvm/x86.c | 6 ++++++ > > > > > > > 1 file changed, 6 insertions(+) > > > > > > > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > > > > index a9d95f90a0487..b75d990fcf12b 100644 > > > > > > > --- a/arch/x86/kvm/x86.c > > > > > > > +++ b/arch/x86/kvm/x86.c > > > > > > > @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit > > > > > > > can_inject = false; > > > > > > > } > > > > > > > > > > > > > > + /* > > > > > > > + * Don't inject interrupts while single stepping to make guest debug easier > > > > > > > + */ > > > > > > > + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) > > > > > > > + return; > > > > > > > > > > > > Is this something userspace can deal with? E.g. disable IRQs and/or set NMI > > > > > > blocking at the start of single-stepping, unwind at the end? Deviating this far > > > > > > from architectural behavior will end in tears at some point. > > > > > > > > > > > > > > > > Does this happen to address this suspicious workaround in the kernel? > > > > > > > > > > /* > > > > > * The kernel doesn't use TF single-step outside of: > > > > > * > > > > > * - Kprobes, consumed through kprobe_debug_handler() > > > > > * - KGDB, consumed through notify_debug() > > > > > * > > > > > * So if we get here with DR_STEP set, something is wonky. > > > > > * > > > > > * A known way to trigger this is through QEMU's GDB stub, > > > > > * which leaks #DB into the guest and causes IST recursion. > > > > > */ > > > > > if (WARN_ON_ONCE(dr6 & DR_STEP)) > > > > > regs->flags &= ~X86_EFLAGS_TF; > > > > > > > > > > (arch/x86/kernel/traps.c, exc_debug_kernel) > > > > > > > > > > I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh, > > > > > yeah, question to myself as well, dancing around broken guest debugging > > > > > for a long time while trying to fix other issues... > > > > > > > > To be honest I didn't see that warning even once, but I can imagine KVM > > > > leaking #DB due to bugs in that code. That area historically didn't receive > > > > much attention since it can only be triggered by > > > > KVM_GET/SET_GUEST_DEBUG which isn't used in production. > > > > > > I've triggered it recently while debugging a guest, that's why I got > > > aware of the code path. Long ago, all this used to work (soft BPs, > > > single-stepping etc.) > > > > > > > The only issue that I on the other hand did > > > > see which is mostly gdb fault is that it fails to remove a software breakpoint > > > > when resuming over it, if that breakpoint's python handler messes up > > > > with gdb's symbols, which is what lx-symbols does. > > > > > > > > And that despite the fact that lx-symbol doesn't mess with the object > > > > (that is the kernel) where the breakpoint is defined. > > > > > > > > Just adding/removing one symbol file is enough to trigger this issue. > > > > > > > > Since lx-symbols already works this around when it reloads all symbols, > > > > I extended that workaround to happen also when loading/unloading > > > > only a single symbol file. > > > > > > You have no issue with interactive debugging when NOT using gdb scripts > > > / lx-symbol? > > > > To be honest I don't use guest debugging that much, > > so I probably missed some issues. > > > > Now that I fixed lx-symbols though I'll probably use > > guest debugging much more. > > I will keep an eye on any issues that I find. > > > > The main push to fix lx-symbols actually came > > from me wanting to understand if there is something > > broken with KVM's guest debugging knowing that > > lx-symbols crashes the guest when module is loaded > > after lx-symbols was executed. > > > > That lx-symbols related guest crash I traced to issue > > with gdb as I explained, and the lack of blocking of the interrupts > > on single step is not a bug but more a missing feature > > that should be implemented to make single step easier to use. > > Again, this used to work fine. But maybe this patch can change the > picture by avoid that the unavoidable short TF leakage into the guest > escalates beyond the single instruction to step over. Actually now I think I understand what is going on. The TF flag isn't auto cleared as RF flag is, and if the instruction which is single stepped gets an interrupt it is pushed onto the interrupt stack. (then it is cleared for the duration of the interrupt handler) Since we use the TF flag for single stepping the guest, this indeed can cause it to be leaked. So this patch actually should mitigate this almost completely. Also now I understand why Intel has the 'monitor trap' feature, I think it is exactly for the cases when hypervisor wants to single step the guest without the fear of changing of the guest visible cpu state. KVM on VMX should probably switch to using monitor trap for single stepping. Best regards, Maxim Levitsky > > > Another issue which isn't a bug is that you can't place a software > > breakpoint if kernel is not loaded (since there is no code in memory) > > or if the kernel haven't done basic paging initialization > > (since there is no paging yet to know where to place the breakpoint). > > Hardware breakpoints work for this fine though. > > > > So in summary I haven't found any major issues with KVM's guest debug > > yet. > > > > If I do notice issues with guest debug, I will try to isolate > > and debug them. > > For the issue that you mentioned, do you have a way to reproduce it? > > I need to spend some time in factoring out a clean test setup, will come > back to you. I'm always pushing this to the back - and then grumble when > hitting it while debugging something urgent. Your patch is a nice reason > to do this systematically now. Thanks for the review, Best regards, Maxim Levitsky > > Jan >
On 16.03.21 15:34, Maxim Levitsky wrote: > On Tue, 2021-03-16 at 14:46 +0100, Jan Kiszka wrote: >> On 16.03.21 13:34, Maxim Levitsky wrote: >>> On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote: >>>> On 16.03.21 11:59, Maxim Levitsky wrote: >>>>> On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote: >>>>>> On 16.03.21 00:37, Sean Christopherson wrote: >>>>>>> On Tue, Mar 16, 2021, Maxim Levitsky wrote: >>>>>>>> This change greatly helps with two issues: >>>>>>>> >>>>>>>> * Resuming from a breakpoint is much more reliable. >>>>>>>> >>>>>>>> When resuming execution from a breakpoint, with interrupts enabled, more often >>>>>>>> than not, KVM would inject an interrupt and make the CPU jump immediately to >>>>>>>> the interrupt handler and eventually return to the breakpoint, to trigger it >>>>>>>> again. >>>>>>>> >>>>>>>> From the user point of view it looks like the CPU never executed a >>>>>>>> single instruction and in some cases that can even prevent forward progress, >>>>>>>> for example, when the breakpoint is placed by an automated script >>>>>>>> (e.g lx-symbols), which does something in response to the breakpoint and then >>>>>>>> continues the guest automatically. >>>>>>>> If the script execution takes enough time for another interrupt to arrive, >>>>>>>> the guest will be stuck on the same breakpoint RIP forever. >>>>>>>> >>>>>>>> * Normal single stepping is much more predictable, since it won't land the >>>>>>>> debugger into an interrupt handler, so it is much more usable. >>>>>>>> >>>>>>>> (If entry to an interrupt handler is desired, the user can still place a >>>>>>>> breakpoint at it and resume the guest, which won't activate this workaround >>>>>>>> and let the gdb still stop at the interrupt handler) >>>>>>>> >>>>>>>> Since this change is only active when guest is debugged, it won't affect >>>>>>>> KVM running normal 'production' VMs. >>>>>>>> >>>>>>>> >>>>>>>> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> >>>>>>>> Tested-by: Stefano Garzarella <sgarzare@redhat.com> >>>>>>>> --- >>>>>>>> arch/x86/kvm/x86.c | 6 ++++++ >>>>>>>> 1 file changed, 6 insertions(+) >>>>>>>> >>>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>>>>>>> index a9d95f90a0487..b75d990fcf12b 100644 >>>>>>>> --- a/arch/x86/kvm/x86.c >>>>>>>> +++ b/arch/x86/kvm/x86.c >>>>>>>> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit >>>>>>>> can_inject = false; >>>>>>>> } >>>>>>>> >>>>>>>> + /* >>>>>>>> + * Don't inject interrupts while single stepping to make guest debug easier >>>>>>>> + */ >>>>>>>> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) >>>>>>>> + return; >>>>>>> >>>>>>> Is this something userspace can deal with? E.g. disable IRQs and/or set NMI >>>>>>> blocking at the start of single-stepping, unwind at the end? Deviating this far >>>>>>> from architectural behavior will end in tears at some point. >>>>>>> >>>>>> >>>>>> Does this happen to address this suspicious workaround in the kernel? >>>>>> >>>>>> /* >>>>>> * The kernel doesn't use TF single-step outside of: >>>>>> * >>>>>> * - Kprobes, consumed through kprobe_debug_handler() >>>>>> * - KGDB, consumed through notify_debug() >>>>>> * >>>>>> * So if we get here with DR_STEP set, something is wonky. >>>>>> * >>>>>> * A known way to trigger this is through QEMU's GDB stub, >>>>>> * which leaks #DB into the guest and causes IST recursion. >>>>>> */ >>>>>> if (WARN_ON_ONCE(dr6 & DR_STEP)) >>>>>> regs->flags &= ~X86_EFLAGS_TF; >>>>>> >>>>>> (arch/x86/kernel/traps.c, exc_debug_kernel) >>>>>> >>>>>> I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh, >>>>>> yeah, question to myself as well, dancing around broken guest debugging >>>>>> for a long time while trying to fix other issues... >>>>> >>>>> To be honest I didn't see that warning even once, but I can imagine KVM >>>>> leaking #DB due to bugs in that code. That area historically didn't receive >>>>> much attention since it can only be triggered by >>>>> KVM_GET/SET_GUEST_DEBUG which isn't used in production. >>>> >>>> I've triggered it recently while debugging a guest, that's why I got >>>> aware of the code path. Long ago, all this used to work (soft BPs, >>>> single-stepping etc.) >>>> >>>>> The only issue that I on the other hand did >>>>> see which is mostly gdb fault is that it fails to remove a software breakpoint >>>>> when resuming over it, if that breakpoint's python handler messes up >>>>> with gdb's symbols, which is what lx-symbols does. >>>>> >>>>> And that despite the fact that lx-symbol doesn't mess with the object >>>>> (that is the kernel) where the breakpoint is defined. >>>>> >>>>> Just adding/removing one symbol file is enough to trigger this issue. >>>>> >>>>> Since lx-symbols already works this around when it reloads all symbols, >>>>> I extended that workaround to happen also when loading/unloading >>>>> only a single symbol file. >>>> >>>> You have no issue with interactive debugging when NOT using gdb scripts >>>> / lx-symbol? >>> >>> To be honest I don't use guest debugging that much, >>> so I probably missed some issues. >>> >>> Now that I fixed lx-symbols though I'll probably use >>> guest debugging much more. >>> I will keep an eye on any issues that I find. >>> >>> The main push to fix lx-symbols actually came >>> from me wanting to understand if there is something >>> broken with KVM's guest debugging knowing that >>> lx-symbols crashes the guest when module is loaded >>> after lx-symbols was executed. >>> >>> That lx-symbols related guest crash I traced to issue >>> with gdb as I explained, and the lack of blocking of the interrupts >>> on single step is not a bug but more a missing feature >>> that should be implemented to make single step easier to use. >> >> Again, this used to work fine. But maybe this patch can change the >> picture by avoid that the unavoidable short TF leakage into the guest >> escalates beyond the single instruction to step over. > > > Actually now I think I understand what is going on. > > The TF flag isn't auto cleared as RF flag is, and if the instruction > which is single stepped gets an interrupt it is pushed onto the interrupt stack. > (then it is cleared for the duration of the interrupt handler) > Since we use the TF flag for single stepping the guest, this indeed can > cause it to be leaked. > > So this patch actually should mitigate this almost completely. > > Also now I understand why Intel has the 'monitor trap' feature, I think it > is exactly for the cases when hypervisor wants to single step the guest > without the fear of changing of the guest visible cpu state. Exactly. > > KVM on VMX should probably switch to using monitor trap for single stepping. Back then, when I was hacking on the gdb-stub and KVM support, the monitor trap flag was not yet broadly available, but the idea to once use it was already there. Now it can be considered broadly available, but it would still require some changes to get it in. Unfortunately, we don't have such thing with SVM, even recent versions, right? So, a proper way of avoiding diverting event injections while we are having the guest in an "incorrect" state should definitely be the goal. Given that KVM knows whether TF originates solely from guest debugging or was (also) injected by the guest, we should be able to identify the cases where your approach is best to apply. And that without any extra control knob that everyone will only forget to set. Jan
On 16.03.21 16:49, Maxim Levitsky wrote: > On Tue, 2021-03-16 at 16:31 +0100, Jan Kiszka wrote: >> On 16.03.21 15:34, Maxim Levitsky wrote: >>> On Tue, 2021-03-16 at 14:46 +0100, Jan Kiszka wrote: >>>> On 16.03.21 13:34, Maxim Levitsky wrote: >>>>> On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote: >>>>>> On 16.03.21 11:59, Maxim Levitsky wrote: >>>>>>> On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote: >>>>>>>> On 16.03.21 00:37, Sean Christopherson wrote: >>>>>>>>> On Tue, Mar 16, 2021, Maxim Levitsky wrote: >>>>>>>>>> This change greatly helps with two issues: >>>>>>>>>> >>>>>>>>>> * Resuming from a breakpoint is much more reliable. >>>>>>>>>> >>>>>>>>>> When resuming execution from a breakpoint, with interrupts enabled, more often >>>>>>>>>> than not, KVM would inject an interrupt and make the CPU jump immediately to >>>>>>>>>> the interrupt handler and eventually return to the breakpoint, to trigger it >>>>>>>>>> again. >>>>>>>>>> >>>>>>>>>> From the user point of view it looks like the CPU never executed a >>>>>>>>>> single instruction and in some cases that can even prevent forward progress, >>>>>>>>>> for example, when the breakpoint is placed by an automated script >>>>>>>>>> (e.g lx-symbols), which does something in response to the breakpoint and then >>>>>>>>>> continues the guest automatically. >>>>>>>>>> If the script execution takes enough time for another interrupt to arrive, >>>>>>>>>> the guest will be stuck on the same breakpoint RIP forever. >>>>>>>>>> >>>>>>>>>> * Normal single stepping is much more predictable, since it won't land the >>>>>>>>>> debugger into an interrupt handler, so it is much more usable. >>>>>>>>>> >>>>>>>>>> (If entry to an interrupt handler is desired, the user can still place a >>>>>>>>>> breakpoint at it and resume the guest, which won't activate this workaround >>>>>>>>>> and let the gdb still stop at the interrupt handler) >>>>>>>>>> >>>>>>>>>> Since this change is only active when guest is debugged, it won't affect >>>>>>>>>> KVM running normal 'production' VMs. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> >>>>>>>>>> Tested-by: Stefano Garzarella <sgarzare@redhat.com> >>>>>>>>>> --- >>>>>>>>>> arch/x86/kvm/x86.c | 6 ++++++ >>>>>>>>>> 1 file changed, 6 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>>>>>>>>> index a9d95f90a0487..b75d990fcf12b 100644 >>>>>>>>>> --- a/arch/x86/kvm/x86.c >>>>>>>>>> +++ b/arch/x86/kvm/x86.c >>>>>>>>>> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit >>>>>>>>>> can_inject = false; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> + /* >>>>>>>>>> + * Don't inject interrupts while single stepping to make guest debug easier >>>>>>>>>> + */ >>>>>>>>>> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) >>>>>>>>>> + return; >>>>>>>>> >>>>>>>>> Is this something userspace can deal with? E.g. disable IRQs and/or set NMI >>>>>>>>> blocking at the start of single-stepping, unwind at the end? Deviating this far >>>>>>>>> from architectural behavior will end in tears at some point. >>>>>>>>> >>>>>>>> >>>>>>>> Does this happen to address this suspicious workaround in the kernel? >>>>>>>> >>>>>>>> /* >>>>>>>> * The kernel doesn't use TF single-step outside of: >>>>>>>> * >>>>>>>> * - Kprobes, consumed through kprobe_debug_handler() >>>>>>>> * - KGDB, consumed through notify_debug() >>>>>>>> * >>>>>>>> * So if we get here with DR_STEP set, something is wonky. >>>>>>>> * >>>>>>>> * A known way to trigger this is through QEMU's GDB stub, >>>>>>>> * which leaks #DB into the guest and causes IST recursion. >>>>>>>> */ >>>>>>>> if (WARN_ON_ONCE(dr6 & DR_STEP)) >>>>>>>> regs->flags &= ~X86_EFLAGS_TF; >>>>>>>> >>>>>>>> (arch/x86/kernel/traps.c, exc_debug_kernel) >>>>>>>> >>>>>>>> I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh, >>>>>>>> yeah, question to myself as well, dancing around broken guest debugging >>>>>>>> for a long time while trying to fix other issues... >>>>>>> >>>>>>> To be honest I didn't see that warning even once, but I can imagine KVM >>>>>>> leaking #DB due to bugs in that code. That area historically didn't receive >>>>>>> much attention since it can only be triggered by >>>>>>> KVM_GET/SET_GUEST_DEBUG which isn't used in production. >>>>>> >>>>>> I've triggered it recently while debugging a guest, that's why I got >>>>>> aware of the code path. Long ago, all this used to work (soft BPs, >>>>>> single-stepping etc.) >>>>>> >>>>>>> The only issue that I on the other hand did >>>>>>> see which is mostly gdb fault is that it fails to remove a software breakpoint >>>>>>> when resuming over it, if that breakpoint's python handler messes up >>>>>>> with gdb's symbols, which is what lx-symbols does. >>>>>>> >>>>>>> And that despite the fact that lx-symbol doesn't mess with the object >>>>>>> (that is the kernel) where the breakpoint is defined. >>>>>>> >>>>>>> Just adding/removing one symbol file is enough to trigger this issue. >>>>>>> >>>>>>> Since lx-symbols already works this around when it reloads all symbols, >>>>>>> I extended that workaround to happen also when loading/unloading >>>>>>> only a single symbol file. >>>>>> >>>>>> You have no issue with interactive debugging when NOT using gdb scripts >>>>>> / lx-symbol? >>>>> >>>>> To be honest I don't use guest debugging that much, >>>>> so I probably missed some issues. >>>>> >>>>> Now that I fixed lx-symbols though I'll probably use >>>>> guest debugging much more. >>>>> I will keep an eye on any issues that I find. >>>>> >>>>> The main push to fix lx-symbols actually came >>>>> from me wanting to understand if there is something >>>>> broken with KVM's guest debugging knowing that >>>>> lx-symbols crashes the guest when module is loaded >>>>> after lx-symbols was executed. >>>>> >>>>> That lx-symbols related guest crash I traced to issue >>>>> with gdb as I explained, and the lack of blocking of the interrupts >>>>> on single step is not a bug but more a missing feature >>>>> that should be implemented to make single step easier to use. >>>> >>>> Again, this used to work fine. But maybe this patch can change the >>>> picture by avoid that the unavoidable short TF leakage into the guest >>>> escalates beyond the single instruction to step over. >>> >>> >>> Actually now I think I understand what is going on. >>> >>> The TF flag isn't auto cleared as RF flag is, and if the instruction >>> which is single stepped gets an interrupt it is pushed onto the interrupt stack. >>> (then it is cleared for the duration of the interrupt handler) >>> Since we use the TF flag for single stepping the guest, this indeed can >>> cause it to be leaked. >>> >>> So this patch actually should mitigate this almost completely. >>> >>> Also now I understand why Intel has the 'monitor trap' feature, I think it >>> is exactly for the cases when hypervisor wants to single step the guest >>> without the fear of changing of the guest visible cpu state. >> >> Exactly. >> >>> >>> KVM on VMX should probably switch to using monitor trap for single stepping. >> >> Back then, when I was hacking on the gdb-stub and KVM support, the >> monitor trap flag was not yet broadly available, but the idea to once >> use it was already there. Now it can be considered broadly available, >> but it would still require some changes to get it in. >> >> Unfortunately, we don't have such thing with SVM, even recent versions, >> right? So, a proper way of avoiding diverting event injections while we >> are having the guest in an "incorrect" state should definitely be the goal. > Yes, I am not aware of anything like monitor trap on SVM. > >> >> Given that KVM knows whether TF originates solely from guest debugging >> or was (also) injected by the guest, we should be able to identify the >> cases where your approach is best to apply. And that without any extra >> control knob that everyone will only forget to set. > Well I think that the downside of this patch is that the user might actually > want to single step into an interrupt handler, > and this patch makes it a bit more complicated, and changes the default > behavior. If the default makes debugging practically impossible and breaks the guest by leaking host state, that is also not OK. We must not leak, that is priority one. So, even if we consider stepping into the interrupt a use case (surely not the default one, so having this opt-in only), we must ensure that TF will never end up on the guest stack saved for the interrupt handler. That is KVM's default responsibility. Jan > > I have no objections though to use this patch as is, or at least make this > the new default with a new flag to override this. > > Sean Christopherson, what do you think? > > Best regards, > Maxim Levitsky > >> >> Jan >> > >
On Tue, Mar 16, 2021, Maxim Levitsky wrote: > On Tue, 2021-03-16 at 16:31 +0100, Jan Kiszka wrote: > > Back then, when I was hacking on the gdb-stub and KVM support, the > > monitor trap flag was not yet broadly available, but the idea to once > > use it was already there. Now it can be considered broadly available, > > but it would still require some changes to get it in. > > > > Unfortunately, we don't have such thing with SVM, even recent versions, > > right? So, a proper way of avoiding diverting event injections while we > > are having the guest in an "incorrect" state should definitely be the goal. > Yes, I am not aware of anything like monitor trap on SVM. > > > > > Given that KVM knows whether TF originates solely from guest debugging > > or was (also) injected by the guest, we should be able to identify the > > cases where your approach is best to apply. And that without any extra > > control knob that everyone will only forget to set. > Well I think that the downside of this patch is that the user might actually > want to single step into an interrupt handler, and this patch makes it a bit > more complicated, and changes the default behavior. Yes. And, as is, this also blocks NMIs and SMIs. I suspect it also doesn't prevent weirdness if the guest is running in L2, since IRQs for L1 will cause exits from L2 during nested_ops->check_events(). > I have no objections though to use this patch as is, or at least make this > the new default with a new flag to override this. That's less bad, but IMO still violates the principle of least surprise, e.g. someone that is single-stepping a guest and is expecting an IRQ to fire will be all kinds of confused if they see all the proper IRR, ISR, EFLAGS.IF, etc... settings, but no interrupt. > Sean Christopherson, what do you think? Rather than block all events in KVM, what about having QEMU "pause" the timer? E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out which flavor it's using), clear them to zero, then restore both when single-stepping is disabled. I think that will work?
On 16.03.21 17:50, Sean Christopherson wrote: > On Tue, Mar 16, 2021, Maxim Levitsky wrote: >> On Tue, 2021-03-16 at 16:31 +0100, Jan Kiszka wrote: >>> Back then, when I was hacking on the gdb-stub and KVM support, the >>> monitor trap flag was not yet broadly available, but the idea to once >>> use it was already there. Now it can be considered broadly available, >>> but it would still require some changes to get it in. >>> >>> Unfortunately, we don't have such thing with SVM, even recent versions, >>> right? So, a proper way of avoiding diverting event injections while we >>> are having the guest in an "incorrect" state should definitely be the goal. >> Yes, I am not aware of anything like monitor trap on SVM. >> >>> >>> Given that KVM knows whether TF originates solely from guest debugging >>> or was (also) injected by the guest, we should be able to identify the >>> cases where your approach is best to apply. And that without any extra >>> control knob that everyone will only forget to set. >> Well I think that the downside of this patch is that the user might actually >> want to single step into an interrupt handler, and this patch makes it a bit >> more complicated, and changes the default behavior. > > Yes. And, as is, this also blocks NMIs and SMIs. I suspect it also doesn't > prevent weirdness if the guest is running in L2, since IRQs for L1 will cause > exits from L2 during nested_ops->check_events(). > >> I have no objections though to use this patch as is, or at least make this >> the new default with a new flag to override this. > > That's less bad, but IMO still violates the principle of least surprise, e.g. > someone that is single-stepping a guest and is expecting an IRQ to fire will be > all kinds of confused if they see all the proper IRR, ISR, EFLAGS.IF, etc... > settings, but no interrupt. From my practical experience with debugging guests via single step, seeing an interrupt in that case is everything but handy and generally also not expected (though logical, I agree). IOW: When there is a knob for it, it will remain off in 99% of the time. But I see the point of having some control, in an ideal world also an indication that there are pending events, permitting the user to decide what to do. But I suspect the gdb frontend and protocol does not easily permit that. > >> Sean Christopherson, what do you think? > > Rather than block all events in KVM, what about having QEMU "pause" the timer? > E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out > which flavor it's using), clear them to zero, then restore both when > single-stepping is disabled. I think that will work? > No one can stop the clock, and timers are only one source of interrupts. Plus they do not all come from QEMU, some also from KVM or in-kernel sources directly. Would quickly become a mess. Jan
On Tue, Mar 16, 2021, Jan Kiszka wrote: > On 16.03.21 17:50, Sean Christopherson wrote: > > Rather than block all events in KVM, what about having QEMU "pause" the timer? > > E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out > > which flavor it's using), clear them to zero, then restore both when > > single-stepping is disabled. I think that will work? > > > > No one can stop the clock, and timers are only one source of interrupts. > Plus they do not all come from QEMU, some also from KVM or in-kernel > sources directly. But are any other sources of interrupts a chronic problem? I 100% agree that this would not be a robust solution, but neither is blocking events in KVM. At least with this approach, the blast radius is somewhat contained. > Would quickly become a mess. Maybe, but it'd be Qemu's mess ;-)
On Tue, 2021-03-16 at 18:01 +0100, Jan Kiszka wrote: > On 16.03.21 17:50, Sean Christopherson wrote: > > On Tue, Mar 16, 2021, Maxim Levitsky wrote: > > > On Tue, 2021-03-16 at 16:31 +0100, Jan Kiszka wrote: > > > > Back then, when I was hacking on the gdb-stub and KVM support, the > > > > monitor trap flag was not yet broadly available, but the idea to once > > > > use it was already there. Now it can be considered broadly available, > > > > but it would still require some changes to get it in. > > > > > > > > Unfortunately, we don't have such thing with SVM, even recent versions, > > > > right? So, a proper way of avoiding diverting event injections while we > > > > are having the guest in an "incorrect" state should definitely be the goal. > > > Yes, I am not aware of anything like monitor trap on SVM. > > > > > > > Given that KVM knows whether TF originates solely from guest debugging > > > > or was (also) injected by the guest, we should be able to identify the > > > > cases where your approach is best to apply. And that without any extra > > > > control knob that everyone will only forget to set. > > > Well I think that the downside of this patch is that the user might actually > > > want to single step into an interrupt handler, and this patch makes it a bit > > > more complicated, and changes the default behavior. > > > > Yes. And, as is, this also blocks NMIs and SMIs. I suspect it also doesn't > > prevent weirdness if the guest is running in L2, since IRQs for L1 will cause > > exits from L2 during nested_ops->check_events(). > > > > > I have no objections though to use this patch as is, or at least make this > > > the new default with a new flag to override this. > > > > That's less bad, but IMO still violates the principle of least surprise, e.g. > > someone that is single-stepping a guest and is expecting an IRQ to fire will be > > all kinds of confused if they see all the proper IRR, ISR, EFLAGS.IF, etc... > > settings, but no interrupt. > > From my practical experience with debugging guests via single step, > seeing an interrupt in that case is everything but handy and generally > also not expected (though logical, I agree). IOW: When there is a knob > for it, it will remain off in 99% of the time. > > But I see the point of having some control, in an ideal world also an > indication that there are pending events, permitting the user to decide > what to do. But I suspect the gdb frontend and protocol does not easily > permit that. Qemu gdbstub actually does have control over suppression of the interrupts over a single step and it is even enabled by default: https://qemu.readthedocs.io/en/latest/system/gdb.html (advanced debug options) However it is currently only implemented in TCG (software emulator) mode and not in KVM mode (I can argue that this is a qemu bug). So my plan was to add a new kvm guest debug flag KVM_GUESTDBG_BLOCKEVENTS, and let qemu enable it when its 'NOIRQ' mode is enabled (it is by default). However due to the discussion in this thread about the leakage of the RFLAGS.TF, I wonder if kvm should by default suppress events and have something like KVM_GUESTDBG_SSTEP_ALLOW_EVENTS to override this and wire that to qemu's NOIRQ=false case. This will allow older qemu to work correctly and new qemu will be able to choose the old less ideal behavior. > > > > Sean Christopherson, what do you think? > > > > Rather than block all events in KVM, what about having QEMU "pause" the timer? > > E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out > > which flavor it's using), clear them to zero, then restore both when > > single-stepping is disabled. I think that will work? > > > > No one can stop the clock, and timers are only one source of interrupts. > Plus they do not all come from QEMU, some also from KVM or in-kernel > sources directly. Would quickly become a mess. This, plus as we see, even changing with RFLAGS.TF leaks it. Changing things like MSR_TSC_DEADLINE will also make it visible to the guest, sooner or later and is a mess that I rather not get into. It is _possible_ to disable timer interrupts 'out of band', but that is messy too if done from userspace. For example, what if the timer interrupt is already pending in local apic, when qemu decides to single step? Also with gdbstub the user doesn't have to stop all vcpus (there is a non-stop mode), in which only some vcpus are stopped which is actually a very cool feature, and of course running vcpus can raise events. Also interrupts can indeed come from things like vhost. Best regards, Maxim Levitsky > Jan >
On 16.03.21 18:26, Sean Christopherson wrote: > On Tue, Mar 16, 2021, Jan Kiszka wrote: >> On 16.03.21 17:50, Sean Christopherson wrote: >>> Rather than block all events in KVM, what about having QEMU "pause" the timer? >>> E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out >>> which flavor it's using), clear them to zero, then restore both when >>> single-stepping is disabled. I think that will work? >>> >> >> No one can stop the clock, and timers are only one source of interrupts. >> Plus they do not all come from QEMU, some also from KVM or in-kernel >> sources directly. > > But are any other sources of interrupts a chronic problem? I 100% agree that If you are debugging a problem, you are not interested in seening problems of the debugger, only real ones of your target. IOW: Yes, they are, even if less likely - for idle VMs. > this would not be a robust solution, but neither is blocking events in KVM. At > least with this approach, the blast radius is somewhat contained. > >> Would quickly become a mess. > > Maybe, but it'd be Qemu's mess ;-) > Nope, it would spread to KVM as well, as indicated above. Jan
On 16.03.21 13:34, Maxim Levitsky wrote: > On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote: >> On 16.03.21 11:59, Maxim Levitsky wrote: >>> On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote: >>>> On 16.03.21 00:37, Sean Christopherson wrote: >>>>> On Tue, Mar 16, 2021, Maxim Levitsky wrote: >>>>>> This change greatly helps with two issues: >>>>>> >>>>>> * Resuming from a breakpoint is much more reliable. >>>>>> >>>>>> When resuming execution from a breakpoint, with interrupts enabled, more often >>>>>> than not, KVM would inject an interrupt and make the CPU jump immediately to >>>>>> the interrupt handler and eventually return to the breakpoint, to trigger it >>>>>> again. >>>>>> >>>>>> From the user point of view it looks like the CPU never executed a >>>>>> single instruction and in some cases that can even prevent forward progress, >>>>>> for example, when the breakpoint is placed by an automated script >>>>>> (e.g lx-symbols), which does something in response to the breakpoint and then >>>>>> continues the guest automatically. >>>>>> If the script execution takes enough time for another interrupt to arrive, >>>>>> the guest will be stuck on the same breakpoint RIP forever. >>>>>> >>>>>> * Normal single stepping is much more predictable, since it won't land the >>>>>> debugger into an interrupt handler, so it is much more usable. >>>>>> >>>>>> (If entry to an interrupt handler is desired, the user can still place a >>>>>> breakpoint at it and resume the guest, which won't activate this workaround >>>>>> and let the gdb still stop at the interrupt handler) >>>>>> >>>>>> Since this change is only active when guest is debugged, it won't affect >>>>>> KVM running normal 'production' VMs. >>>>>> >>>>>> >>>>>> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> >>>>>> Tested-by: Stefano Garzarella <sgarzare@redhat.com> >>>>>> --- >>>>>> arch/x86/kvm/x86.c | 6 ++++++ >>>>>> 1 file changed, 6 insertions(+) >>>>>> >>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>>>>> index a9d95f90a0487..b75d990fcf12b 100644 >>>>>> --- a/arch/x86/kvm/x86.c >>>>>> +++ b/arch/x86/kvm/x86.c >>>>>> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit >>>>>> can_inject = false; >>>>>> } >>>>>> >>>>>> + /* >>>>>> + * Don't inject interrupts while single stepping to make guest debug easier >>>>>> + */ >>>>>> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) >>>>>> + return; >>>>> >>>>> Is this something userspace can deal with? E.g. disable IRQs and/or set NMI >>>>> blocking at the start of single-stepping, unwind at the end? Deviating this far >>>>> from architectural behavior will end in tears at some point. >>>>> >>>> >>>> Does this happen to address this suspicious workaround in the kernel? >>>> >>>> /* >>>> * The kernel doesn't use TF single-step outside of: >>>> * >>>> * - Kprobes, consumed through kprobe_debug_handler() >>>> * - KGDB, consumed through notify_debug() >>>> * >>>> * So if we get here with DR_STEP set, something is wonky. >>>> * >>>> * A known way to trigger this is through QEMU's GDB stub, >>>> * which leaks #DB into the guest and causes IST recursion. >>>> */ >>>> if (WARN_ON_ONCE(dr6 & DR_STEP)) >>>> regs->flags &= ~X86_EFLAGS_TF; >>>> >>>> (arch/x86/kernel/traps.c, exc_debug_kernel) >>>> >>>> I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh, >>>> yeah, question to myself as well, dancing around broken guest debugging >>>> for a long time while trying to fix other issues... >>> >>> To be honest I didn't see that warning even once, but I can imagine KVM >>> leaking #DB due to bugs in that code. That area historically didn't receive >>> much attention since it can only be triggered by >>> KVM_GET/SET_GUEST_DEBUG which isn't used in production. >> >> I've triggered it recently while debugging a guest, that's why I got >> aware of the code path. Long ago, all this used to work (soft BPs, >> single-stepping etc.) >> >>> The only issue that I on the other hand did >>> see which is mostly gdb fault is that it fails to remove a software breakpoint >>> when resuming over it, if that breakpoint's python handler messes up >>> with gdb's symbols, which is what lx-symbols does. >>> >>> And that despite the fact that lx-symbol doesn't mess with the object >>> (that is the kernel) where the breakpoint is defined. >>> >>> Just adding/removing one symbol file is enough to trigger this issue. >>> >>> Since lx-symbols already works this around when it reloads all symbols, >>> I extended that workaround to happen also when loading/unloading >>> only a single symbol file. >> >> You have no issue with interactive debugging when NOT using gdb scripts >> / lx-symbol? > > To be honest I don't use guest debugging that much, > so I probably missed some issues. > > Now that I fixed lx-symbols though I'll probably use > guest debugging much more. > I will keep an eye on any issues that I find. > > The main push to fix lx-symbols actually came > from me wanting to understand if there is something > broken with KVM's guest debugging knowing that > lx-symbols crashes the guest when module is loaded > after lx-symbols was executed. > > That lx-symbols related guest crash I traced to issue > with gdb as I explained, and the lack of blocking of the interrupts > on single step is not a bug but more a missing feature > that should be implemented to make single step easier to use. > > Another issue which isn't a bug is that you can't place a software > breakpoint if kernel is not loaded (since there is no code in memory) > or if the kernel haven't done basic paging initialization > (since there is no paging yet to know where to place the breakpoint). > Hardware breakpoints work for this fine though. > > So in summary I haven't found any major issues with KVM's guest debug > yet. > > If I do notice issues with guest debug, I will try to isolate > and debug them. > For the issue that you mentioned, do you have a way to reproduce it? To pick this up again, I did some experiments and was easily able to reproduce the main problem: - checkout linux master (1df27313f50 yesterday) - build a fitting host kernel with KVM - build a target kernel with defconfig + CONFIG_KVM_GUEST + CONFIG_DEBUG_INFO, no gdb scripts for now - boot the guest qemu-system-x86_64 linux.img -enable-kvm -smp 4 -s -kernel bzImage -append "console=ttyS0 root=... nokaslr" - gdb vmlinux - tar rem :1234 - b __x64_sys_execve - continue whenever you hit the breakpoint, and the guest will eventually hang - or stepi, and you may end up in the interrupt handler - specifically doing that on the serial console or setting the bp in early boot exposes the issue I've also seen WARNING: CPU: 3 PID: 751 at ../arch/x86/kernel/traps.c:915 exc_debug+0x16b/0x1c0 from time to time. When I apply this patch to the host, the problems are gone. Interestingly, my OpenSUSE 5.3.18-lp152.66-default does not show this behavior and /seems/ stable from quick testing. Not sure if there were changes in upstream between that baseline and head or if Suse is carrying a local fix. In any case, I think we need the following: - default disabling of event injections when guest debugging injected TF - possibly some control interface to allow events BUT then avoids any TF injection to prevent guest state corruptions - exposure of that interface to the gdb frontend, maybe by making it available via the QEMU monitor (monitor ...) - a for kvm-unit-tests to trigger the above corner cases Jan
[only saw this now, or delivery to me was delayed - anyway] On 16.03.21 19:02, Maxim Levitsky wrote: > On Tue, 2021-03-16 at 18:01 +0100, Jan Kiszka wrote: >> On 16.03.21 17:50, Sean Christopherson wrote: >>> On Tue, Mar 16, 2021, Maxim Levitsky wrote: >>>> On Tue, 2021-03-16 at 16:31 +0100, Jan Kiszka wrote: >>>>> Back then, when I was hacking on the gdb-stub and KVM support, the >>>>> monitor trap flag was not yet broadly available, but the idea to once >>>>> use it was already there. Now it can be considered broadly available, >>>>> but it would still require some changes to get it in. >>>>> >>>>> Unfortunately, we don't have such thing with SVM, even recent versions, >>>>> right? So, a proper way of avoiding diverting event injections while we >>>>> are having the guest in an "incorrect" state should definitely be the goal. >>>> Yes, I am not aware of anything like monitor trap on SVM. >>>> >>>>> Given that KVM knows whether TF originates solely from guest debugging >>>>> or was (also) injected by the guest, we should be able to identify the >>>>> cases where your approach is best to apply. And that without any extra >>>>> control knob that everyone will only forget to set. >>>> Well I think that the downside of this patch is that the user might actually >>>> want to single step into an interrupt handler, and this patch makes it a bit >>>> more complicated, and changes the default behavior. >>> >>> Yes. And, as is, this also blocks NMIs and SMIs. I suspect it also doesn't >>> prevent weirdness if the guest is running in L2, since IRQs for L1 will cause >>> exits from L2 during nested_ops->check_events(). >>> >>>> I have no objections though to use this patch as is, or at least make this >>>> the new default with a new flag to override this. >>> >>> That's less bad, but IMO still violates the principle of least surprise, e.g. >>> someone that is single-stepping a guest and is expecting an IRQ to fire will be >>> all kinds of confused if they see all the proper IRR, ISR, EFLAGS.IF, etc... >>> settings, but no interrupt. >> >> From my practical experience with debugging guests via single step, >> seeing an interrupt in that case is everything but handy and generally >> also not expected (though logical, I agree). IOW: When there is a knob >> for it, it will remain off in 99% of the time. >> >> But I see the point of having some control, in an ideal world also an >> indication that there are pending events, permitting the user to decide >> what to do. But I suspect the gdb frontend and protocol does not easily >> permit that. > > Qemu gdbstub actually does have control over suppression of the interrupts > over a single step and it is even enabled by default: > > https://qemu.readthedocs.io/en/latest/system/gdb.html > (advanced debug options) > Ah, cool! Absolutely in line with what we need. > However it is currently only implemented in TCG (software emulator) mode > and not in KVM mode (I can argue that this is a qemu bug). Maybe the behavior of old KVM was not exposing the issue, thus no one cared. As I wrote in the other mail today, even some recent kernel do not seem to break single-stepping, for yet unknown reasons. > > So my plan was to add a new kvm guest debug flag KVM_GUESTDBG_BLOCKEVENTS, > and let qemu enable it when its 'NOIRQ' mode is enabled (it is by default). > > However due to the discussion in this thread about the leakage of the RFLAGS.TF, > I wonder if kvm should by default suppress events and have something like > KVM_GUESTDBG_SSTEP_ALLOW_EVENTS to override this and wire > that to qemu's NOIRQ=false case. > > This will allow older qemu to work correctly and new qemu will be able to choose > the old less ideal behavior. Sounds very reasonable to me. > >> >>>> Sean Christopherson, what do you think? >>> >>> Rather than block all events in KVM, what about having QEMU "pause" the timer? >>> E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out >>> which flavor it's using), clear them to zero, then restore both when >>> single-stepping is disabled. I think that will work? >>> >> >> No one can stop the clock, and timers are only one source of interrupts. >> Plus they do not all come from QEMU, some also from KVM or in-kernel >> sources directly. Would quickly become a mess. > > This, plus as we see, even changing with RFLAGS.TF leaks it. As I wrote: When we take events, the leakage must be stopped for that case. But that might be a bit more tricky because we need to stop on the first instruction in the interrupt handler, thus need some TF, but we must also remove it again from the flags saved for the interrupt context on the guest's interrupt/exception handler stack. > Changing things like MSR_TSC_DEADLINE will also make it visible to the guest, > sooner or later and is a mess that I rather not get into. > > It is _possible_ to disable timer interrupts 'out of band', but that is messy too > if done from userspace. For example, what if the timer interrupt is already pending > in local apic, when qemu decides to single step? > > Also with gdbstub the user doesn't have to stop all vcpus (there is a non-stop mode), > in which only some vcpus are stopped which is actually a very cool feature, > and of course running vcpus can raise events. > > Also interrupts can indeed come from things like vhost. > Exactly. Jan
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a9d95f90a0487..b75d990fcf12b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit can_inject = false; } + /* + * Don't inject interrupts while single stepping to make guest debug easier + */ + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) + return; + /* * Finally, inject interrupt events. If an event cannot be injected * due to architectural conditions (e.g. IF=0) a window-open exit