diff mbox series

[4/4] KVM: VMX: update vcpu posted-interrupt descriptor when assigning device

Message ID 20210507130923.528132061@redhat.com (mailing list archive)
State New, archived
Headers show
Series VMX: configure posted interrupt descriptor when assigning device | expand

Commit Message

Marcelo Tosatti May 7, 2021, 1:06 p.m. UTC
For VMX, when a vcpu enters HLT emulation, pi_post_block will:

1) Add vcpu to per-cpu list of blocked vcpus.

2) Program the posted-interrupt descriptor "notification vector" 
to POSTED_INTR_WAKEUP_VECTOR

With interrupt remapping, an interrupt will set the PIR bit for the 
vector programmed for the device on the CPU, test-and-set the 
ON bit on the posted interrupt descriptor, and if the ON bit is clear
generate an interrupt for the notification vector.

This way, the target CPU wakes upon a device interrupt and wakes up
the target vcpu.

Problem is that pi_post_block only programs the notification vector
if kvm_arch_has_assigned_device() is true. Its possible for the
following to happen:

1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false,
notification vector is not programmed
2) device is assigned to VM
3) device interrupts vcpu V, sets ON bit
(notification vector not programmed, so pcpu P remains in idle)
4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set,
kvm_vcpu_kick is skipped
5) vcpu 0 busy spins on vcpu V's response for several seconds, until
RCU watchdog NMIs all vCPUs.

To fix this, use the start_assignment kvm_x86_ops callback to kick
vcpus out of the halt loop, so the notification vector is 
properly reprogrammed to the wakeup vector.

Reported-by: Pei Zhang <pezhang@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

v2: add vmx_pi_start_assignment to vmx's kvm_x86_ops

Comments

Sean Christopherson May 7, 2021, 5:22 p.m. UTC | #1
On Fri, May 07, 2021, Marcelo Tosatti wrote:
> Index: kvm/arch/x86/kvm/vmx/posted_intr.c
> ===================================================================
> --- kvm.orig/arch/x86/kvm/vmx/posted_intr.c
> +++ kvm/arch/x86/kvm/vmx/posted_intr.c
> @@ -203,6 +203,25 @@ void pi_post_block(struct kvm_vcpu *vcpu
>  	local_irq_enable();
>  }
>  
> +int vmx_vcpu_check_block(struct kvm_vcpu *vcpu)
> +{
> +	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> +
> +	if (!irq_remapping_cap(IRQ_POSTING_CAP))
> +		return 0;
> +
> +	if (!kvm_vcpu_apicv_active(vcpu))
> +		return 0;
> +
> +	if (!kvm_arch_has_assigned_device(vcpu->kvm))
> +		return 0;
> +
> +	if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR)
> +		return 0;
> +
> +	return 1;

IIUC, the logic is to bail out of the block loop if the VM has an assigned
device, but the blocking vCPU didn't reconfigure the PI.NV to the wakeup vector,
i.e. the assigned device came along after the initial check in vcpu_block().
That makes sense, but you can add a comment somewhere in/above this function?

> +}
> +
>  /*
>   * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
>   */
> @@ -236,6 +255,26 @@ bool pi_has_pending_interrupt(struct kvm
>  		(pi_test_sn(pi_desc) && !pi_is_pir_empty(pi_desc));
>  }
>  
> +void vmx_pi_start_assignment(struct kvm *kvm, int device_count)
> +{
> +	struct kvm_vcpu *vcpu;
> +	int i;
> +
> +	if (!irq_remapping_cap(IRQ_POSTING_CAP))
> +		return;
> +
> +	/* only care about first device assignment */
> +	if (device_count != 1)
> +		return;
> +
> +	/* Update wakeup vector and add vcpu to blocked_vcpu_list */

Can you expand this comment, too?  Specifically, I think what you're saying is
that the wakeup will cause the vCPU to bail out of kvm_vcpu_block() and go back
through vcpu_block() and thus pi_pre_block().

> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (!kvm_vcpu_apicv_active(vcpu))
> +			continue;
> +
> +		kvm_vcpu_kick(vcpu);

Actually, can't we avoid the full kick and instead just do kvm_vcpu_wake_up()?
If the vCPU is in guest mode, i.e. kvm_arch_vcpu_should_kick() returns true,
then by definition it can't be blocking.  And if it about to block, it's
guaranteed to see the assigned device.

> +	}
> +}
Peter Xu May 7, 2021, 7:29 p.m. UTC | #2
On Fri, May 07, 2021 at 05:22:07PM +0000, Sean Christopherson wrote:
> On Fri, May 07, 2021, Marcelo Tosatti wrote:
> > Index: kvm/arch/x86/kvm/vmx/posted_intr.c
> > ===================================================================
> > --- kvm.orig/arch/x86/kvm/vmx/posted_intr.c
> > +++ kvm/arch/x86/kvm/vmx/posted_intr.c
> > @@ -203,6 +203,25 @@ void pi_post_block(struct kvm_vcpu *vcpu
> >  	local_irq_enable();
> >  }
> >  
> > +int vmx_vcpu_check_block(struct kvm_vcpu *vcpu)
> > +{
> > +	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > +
> > +	if (!irq_remapping_cap(IRQ_POSTING_CAP))
> > +		return 0;
> > +
> > +	if (!kvm_vcpu_apicv_active(vcpu))
> > +		return 0;
> > +
> > +	if (!kvm_arch_has_assigned_device(vcpu->kvm))
> > +		return 0;
> > +
> > +	if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR)
> > +		return 0;
> > +
> > +	return 1;
> 
> IIUC, the logic is to bail out of the block loop if the VM has an assigned
> device, but the blocking vCPU didn't reconfigure the PI.NV to the wakeup vector,
> i.e. the assigned device came along after the initial check in vcpu_block().
> That makes sense, but you can add a comment somewhere in/above this function?

Wondering whether we should add a pi_test_on() check in kvm_vcpu_has_events()
somehow, so that even without customized ->vcpu_check_block we should be able
to break the block loop (as kvm_arch_vcpu_runnable will return true properly)?
Marcelo Tosatti May 7, 2021, 10:08 p.m. UTC | #3
On Fri, May 07, 2021 at 03:29:05PM -0400, Peter Xu wrote:
> On Fri, May 07, 2021 at 05:22:07PM +0000, Sean Christopherson wrote:
> > On Fri, May 07, 2021, Marcelo Tosatti wrote:
> > > Index: kvm/arch/x86/kvm/vmx/posted_intr.c
> > > ===================================================================
> > > --- kvm.orig/arch/x86/kvm/vmx/posted_intr.c
> > > +++ kvm/arch/x86/kvm/vmx/posted_intr.c
> > > @@ -203,6 +203,25 @@ void pi_post_block(struct kvm_vcpu *vcpu
> > >  	local_irq_enable();
> > >  }
> > >  
> > > +int vmx_vcpu_check_block(struct kvm_vcpu *vcpu)
> > > +{
> > > +	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > > +
> > > +	if (!irq_remapping_cap(IRQ_POSTING_CAP))
> > > +		return 0;
> > > +
> > > +	if (!kvm_vcpu_apicv_active(vcpu))
> > > +		return 0;
> > > +
> > > +	if (!kvm_arch_has_assigned_device(vcpu->kvm))
> > > +		return 0;
> > > +
> > > +	if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR)
> > > +		return 0;
> > > +
> > > +	return 1;
> > 
> > IIUC, the logic is to bail out of the block loop if the VM has an assigned
> > device, but the blocking vCPU didn't reconfigure the PI.NV to the wakeup vector,
> > i.e. the assigned device came along after the initial check in vcpu_block().
> > That makes sense, but you can add a comment somewhere in/above this function?
> 
> Wondering whether we should add a pi_test_on() check in kvm_vcpu_has_events()
> somehow, so that even without customized ->vcpu_check_block we should be able
> to break the block loop (as kvm_arch_vcpu_runnable will return true properly)?

static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
{
        int ret = -EINTR;
        int idx = srcu_read_lock(&vcpu->kvm->srcu);

        if (kvm_arch_vcpu_runnable(vcpu)) {
                kvm_make_request(KVM_REQ_UNHALT, vcpu); <---
                goto out;
        }

Don't want to unhalt the vcpu.
Peter Xu May 11, 2021, 2:39 p.m. UTC | #4
On Fri, May 07, 2021 at 07:08:31PM -0300, Marcelo Tosatti wrote:
> > Wondering whether we should add a pi_test_on() check in kvm_vcpu_has_events()
> > somehow, so that even without customized ->vcpu_check_block we should be able
> > to break the block loop (as kvm_arch_vcpu_runnable will return true properly)?
> 
> static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> {
>         int ret = -EINTR;
>         int idx = srcu_read_lock(&vcpu->kvm->srcu);
> 
>         if (kvm_arch_vcpu_runnable(vcpu)) {
>                 kvm_make_request(KVM_REQ_UNHALT, vcpu); <---
>                 goto out;
>         }
> 
> Don't want to unhalt the vcpu.

Could you elaborate?  It's not obvious to me why we can't do that if
pi_test_on() returns true..  we have pending post interrupts anyways, so
shouldn't we stop halting?  Thanks!
Marcelo Tosatti May 11, 2021, 2:51 p.m. UTC | #5
On Tue, May 11, 2021 at 10:39:11AM -0400, Peter Xu wrote:
> On Fri, May 07, 2021 at 07:08:31PM -0300, Marcelo Tosatti wrote:
> > > Wondering whether we should add a pi_test_on() check in kvm_vcpu_has_events()
> > > somehow, so that even without customized ->vcpu_check_block we should be able
> > > to break the block loop (as kvm_arch_vcpu_runnable will return true properly)?
> > 
> > static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> > {
> >         int ret = -EINTR;
> >         int idx = srcu_read_lock(&vcpu->kvm->srcu);
> > 
> >         if (kvm_arch_vcpu_runnable(vcpu)) {
> >                 kvm_make_request(KVM_REQ_UNHALT, vcpu); <---
> >                 goto out;
> >         }
> > 
> > Don't want to unhalt the vcpu.
> 
> Could you elaborate?  It's not obvious to me why we can't do that if
> pi_test_on() returns true..  we have pending post interrupts anyways, so
> shouldn't we stop halting?  Thanks!

pi_test_on() only returns true when an interrupt is signalled by the
device. But the sequence of events is:


1. pCPU idles without notification vector configured to wakeup vector.

2. PCI device is hotplugged, assigned device count increases from 0 to 1.

<arbitrary amount of time>

3. device generates interrupt, sets ON bit to true in the posted
interrupt descriptor.

We want to exit kvm_vcpu_block after 2, but before 3 (where ON bit
is not set).
Peter Xu May 11, 2021, 4:19 p.m. UTC | #6
On Tue, May 11, 2021 at 11:51:57AM -0300, Marcelo Tosatti wrote:
> On Tue, May 11, 2021 at 10:39:11AM -0400, Peter Xu wrote:
> > On Fri, May 07, 2021 at 07:08:31PM -0300, Marcelo Tosatti wrote:
> > > > Wondering whether we should add a pi_test_on() check in kvm_vcpu_has_events()
> > > > somehow, so that even without customized ->vcpu_check_block we should be able
> > > > to break the block loop (as kvm_arch_vcpu_runnable will return true properly)?
> > > 
> > > static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> > > {
> > >         int ret = -EINTR;
> > >         int idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > 
> > >         if (kvm_arch_vcpu_runnable(vcpu)) {
> > >                 kvm_make_request(KVM_REQ_UNHALT, vcpu); <---
> > >                 goto out;
> > >         }
> > > 
> > > Don't want to unhalt the vcpu.
> > 
> > Could you elaborate?  It's not obvious to me why we can't do that if
> > pi_test_on() returns true..  we have pending post interrupts anyways, so
> > shouldn't we stop halting?  Thanks!
> 
> pi_test_on() only returns true when an interrupt is signalled by the
> device. But the sequence of events is:
> 
> 
> 1. pCPU idles without notification vector configured to wakeup vector.
> 
> 2. PCI device is hotplugged, assigned device count increases from 0 to 1.
> 
> <arbitrary amount of time>
> 
> 3. device generates interrupt, sets ON bit to true in the posted
> interrupt descriptor.
> 
> We want to exit kvm_vcpu_block after 2, but before 3 (where ON bit
> is not set).

Ah yes.. thanks.

Besides the current approach, I'm thinking maybe it'll be cleaner/less LOC to
define a KVM_REQ_UNBLOCK to replace the pre_block hook (in x86's kvm_host.h):

#define KVM_REQ_UNBLOCK			KVM_ARCH_REQ(31)

We can set it in vmx_pi_start_assignment(), then check+clear it in
kvm_vcpu_has_events() (or make it a bool in kvm_vcpu struct?).

The thing is current vmx_vcpu_check_block() is mostly a sanity check and
copy-paste of the pi checks on a few items, so maybe cleaner to use
KVM_REQ_UNBLOCK, as it might be reused in the future for re-evaluating of
pre-block for similar purpose?

No strong opinion, though.
Marcelo Tosatti May 11, 2021, 5:18 p.m. UTC | #7
On Tue, May 11, 2021 at 12:19:56PM -0400, Peter Xu wrote:
> On Tue, May 11, 2021 at 11:51:57AM -0300, Marcelo Tosatti wrote:
> > On Tue, May 11, 2021 at 10:39:11AM -0400, Peter Xu wrote:
> > > On Fri, May 07, 2021 at 07:08:31PM -0300, Marcelo Tosatti wrote:
> > > > > Wondering whether we should add a pi_test_on() check in kvm_vcpu_has_events()
> > > > > somehow, so that even without customized ->vcpu_check_block we should be able
> > > > > to break the block loop (as kvm_arch_vcpu_runnable will return true properly)?
> > > > 
> > > > static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> > > > {
> > > >         int ret = -EINTR;
> > > >         int idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > > 
> > > >         if (kvm_arch_vcpu_runnable(vcpu)) {
> > > >                 kvm_make_request(KVM_REQ_UNHALT, vcpu); <---
> > > >                 goto out;
> > > >         }
> > > > 
> > > > Don't want to unhalt the vcpu.
> > > 
> > > Could you elaborate?  It's not obvious to me why we can't do that if
> > > pi_test_on() returns true..  we have pending post interrupts anyways, so
> > > shouldn't we stop halting?  Thanks!
> > 
> > pi_test_on() only returns true when an interrupt is signalled by the
> > device. But the sequence of events is:
> > 
> > 
> > 1. pCPU idles without notification vector configured to wakeup vector.
> > 
> > 2. PCI device is hotplugged, assigned device count increases from 0 to 1.
> > 
> > <arbitrary amount of time>
> > 
> > 3. device generates interrupt, sets ON bit to true in the posted
> > interrupt descriptor.
> > 
> > We want to exit kvm_vcpu_block after 2, but before 3 (where ON bit
> > is not set).
> 
> Ah yes.. thanks.
> 
> Besides the current approach, I'm thinking maybe it'll be cleaner/less LOC to
> define a KVM_REQ_UNBLOCK to replace the pre_block hook (in x86's kvm_host.h):
> 
> #define KVM_REQ_UNBLOCK			KVM_ARCH_REQ(31)
> 
> We can set it in vmx_pi_start_assignment(), then check+clear it in
> kvm_vcpu_has_events() (or make it a bool in kvm_vcpu struct?).

Can't check it in kvm_vcpu_has_events() because that will set
KVM_REQ_UNHALT (which we don't want).

I think KVM_REQ_UNBLOCK will add more lines of code.

> The thing is current vmx_vcpu_check_block() is mostly a sanity check and
> copy-paste of the pi checks on a few items, so maybe cleaner to use
> KVM_REQ_UNBLOCK, as it might be reused in the future for re-evaluating of
> pre-block for similar purpose?
> 
> No strong opinion, though.

Hum... IMHO v3 is quite clean already (although i don't object to your
suggestion).

Paolo, what do you think?
Peter Xu May 11, 2021, 9:35 p.m. UTC | #8
On Tue, May 11, 2021 at 02:18:10PM -0300, Marcelo Tosatti wrote:
> On Tue, May 11, 2021 at 12:19:56PM -0400, Peter Xu wrote:
> > On Tue, May 11, 2021 at 11:51:57AM -0300, Marcelo Tosatti wrote:
> > > On Tue, May 11, 2021 at 10:39:11AM -0400, Peter Xu wrote:
> > > > On Fri, May 07, 2021 at 07:08:31PM -0300, Marcelo Tosatti wrote:
> > > > > > Wondering whether we should add a pi_test_on() check in kvm_vcpu_has_events()
> > > > > > somehow, so that even without customized ->vcpu_check_block we should be able
> > > > > > to break the block loop (as kvm_arch_vcpu_runnable will return true properly)?
> > > > > 
> > > > > static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> > > > > {
> > > > >         int ret = -EINTR;
> > > > >         int idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > > > 
> > > > >         if (kvm_arch_vcpu_runnable(vcpu)) {
> > > > >                 kvm_make_request(KVM_REQ_UNHALT, vcpu); <---
> > > > >                 goto out;
> > > > >         }
> > > > > 
> > > > > Don't want to unhalt the vcpu.
> > > > 
> > > > Could you elaborate?  It's not obvious to me why we can't do that if
> > > > pi_test_on() returns true..  we have pending post interrupts anyways, so
> > > > shouldn't we stop halting?  Thanks!
> > > 
> > > pi_test_on() only returns true when an interrupt is signalled by the
> > > device. But the sequence of events is:
> > > 
> > > 
> > > 1. pCPU idles without notification vector configured to wakeup vector.
> > > 
> > > 2. PCI device is hotplugged, assigned device count increases from 0 to 1.
> > > 
> > > <arbitrary amount of time>
> > > 
> > > 3. device generates interrupt, sets ON bit to true in the posted
> > > interrupt descriptor.
> > > 
> > > We want to exit kvm_vcpu_block after 2, but before 3 (where ON bit
> > > is not set).
> > 
> > Ah yes.. thanks.
> > 
> > Besides the current approach, I'm thinking maybe it'll be cleaner/less LOC to
> > define a KVM_REQ_UNBLOCK to replace the pre_block hook (in x86's kvm_host.h):
> > 
> > #define KVM_REQ_UNBLOCK			KVM_ARCH_REQ(31)
> > 
> > We can set it in vmx_pi_start_assignment(), then check+clear it in
> > kvm_vcpu_has_events() (or make it a bool in kvm_vcpu struct?).
> 
> Can't check it in kvm_vcpu_has_events() because that will set
> KVM_REQ_UNHALT (which we don't want).

I thought it was okay to break the guest HLT? As IMHO the guest code should
always be able to re-run the HLT when interrupted?  As IIUC HLT can easily be
interrupted by e.g., SMIs, according to SDM Vol.2.  Not to mention vfio hotplug
should be rare, and we'll only trigger this once for the 1st device.

> 
> I think KVM_REQ_UNBLOCK will add more lines of code.

It's very possible I overlooked something above... but if breaking HLT
unregularly is okay, I attached one patch that is based on your v3 series, just
dropped the vcpu_check_block() but use KVM_REQ_UNBLOCK (no compile test even,
just to satisfy my own curiosity on how many loc we can save.. :), it gives me:

 7 files changed, 5 insertions(+), 41 deletions(-)

But again, I could have missed something...

Thanks,

> 
> > The thing is current vmx_vcpu_check_block() is mostly a sanity check and
> > copy-paste of the pi checks on a few items, so maybe cleaner to use
> > KVM_REQ_UNBLOCK, as it might be reused in the future for re-evaluating of
> > pre-block for similar purpose?
> > 
> > No strong opinion, though.
> 
> Hum... IMHO v3 is quite clean already (although i don't object to your
> suggestion).
> 
> Paolo, what do you think?
> 
> 
>
Marcelo Tosatti May 11, 2021, 11:51 p.m. UTC | #9
On Tue, May 11, 2021 at 05:35:41PM -0400, Peter Xu wrote:
> On Tue, May 11, 2021 at 02:18:10PM -0300, Marcelo Tosatti wrote:
> > On Tue, May 11, 2021 at 12:19:56PM -0400, Peter Xu wrote:
> > > On Tue, May 11, 2021 at 11:51:57AM -0300, Marcelo Tosatti wrote:
> > > > On Tue, May 11, 2021 at 10:39:11AM -0400, Peter Xu wrote:
> > > > > On Fri, May 07, 2021 at 07:08:31PM -0300, Marcelo Tosatti wrote:
> > > > > > > Wondering whether we should add a pi_test_on() check in kvm_vcpu_has_events()
> > > > > > > somehow, so that even without customized ->vcpu_check_block we should be able
> > > > > > > to break the block loop (as kvm_arch_vcpu_runnable will return true properly)?
> > > > > > 
> > > > > > static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> > > > > > {
> > > > > >         int ret = -EINTR;
> > > > > >         int idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > > > > 
> > > > > >         if (kvm_arch_vcpu_runnable(vcpu)) {
> > > > > >                 kvm_make_request(KVM_REQ_UNHALT, vcpu); <---
> > > > > >                 goto out;
> > > > > >         }
> > > > > > 
> > > > > > Don't want to unhalt the vcpu.
> > > > > 
> > > > > Could you elaborate?  It's not obvious to me why we can't do that if
> > > > > pi_test_on() returns true..  we have pending post interrupts anyways, so
> > > > > shouldn't we stop halting?  Thanks!
> > > > 
> > > > pi_test_on() only returns true when an interrupt is signalled by the
> > > > device. But the sequence of events is:
> > > > 
> > > > 
> > > > 1. pCPU idles without notification vector configured to wakeup vector.
> > > > 
> > > > 2. PCI device is hotplugged, assigned device count increases from 0 to 1.
> > > > 
> > > > <arbitrary amount of time>
> > > > 
> > > > 3. device generates interrupt, sets ON bit to true in the posted
> > > > interrupt descriptor.
> > > > 
> > > > We want to exit kvm_vcpu_block after 2, but before 3 (where ON bit
> > > > is not set).
> > > 
> > > Ah yes.. thanks.
> > > 
> > > Besides the current approach, I'm thinking maybe it'll be cleaner/less LOC to
> > > define a KVM_REQ_UNBLOCK to replace the pre_block hook (in x86's kvm_host.h):
> > > 
> > > #define KVM_REQ_UNBLOCK			KVM_ARCH_REQ(31)
> > > 
> > > We can set it in vmx_pi_start_assignment(), then check+clear it in
> > > kvm_vcpu_has_events() (or make it a bool in kvm_vcpu struct?).
> > 
> > Can't check it in kvm_vcpu_has_events() because that will set
> > KVM_REQ_UNHALT (which we don't want).
> 
> I thought it was okay to break the guest HLT? 

Intel:

"HLT-HALT

Description

Stops instruction execution and places the processor in a HALT state. An enabled interrupt (including NMI and
SMI), a debug exception, the BINIT# signal, the INIT# signal, or the RESET# signal will resume execution. If an
interrupt (including NMI) is used to resume execution after a HLT instruction, the saved instruction pointer
(CS:EIP) points to the instruction following the HLT instruction."

AMD:

"6.5 Processor Halt
The processor halt instruction (HLT) halts instruction execution, leaving the processor in the halt state.
No registers or machine state are modified as a result of executing the HLT instruction. The processor
remains in the halt state until one of the following occurs:
• A non-maskable interrupt (NMI).
• An enabled, maskable interrupt (INTR).
• Processor reset (RESET).
• Processor initialization (INIT).
• System-management interrupt (SMI)."

The KVM_REQ_UNBLOCK patch will resume execution even any such event
occuring. So the behaviour would be different from baremetal.

> As IMHO the guest code should
> always be able to re-run the HLT when interrupted?  As IIUC HLT can easily be
> interrupted by e.g., SMIs, according to SDM Vol.2.  

CPU will by default return to HLT'ed state, not continue to the
instruction following HLT, on SMI:

34.10 AUTO HALT RESTART
If the processor is in a HALT state (due to the prior execution of a HLT instruction) when it receives an SMI, the
processor records the fact in the auto HALT restart flag in the saved processor state (see Figure 34-3). (This flag is
located at offset 7F02H and bit 0 in the state save area of the SMRAM.)
If the processor sets the auto HALT restart flag upon entering SMM (indicating that the SMI occurred when the
processor was in the HALT state), the SMI handler has two options:
* It can leave the auto HALT restart flag set, which instructs the RSM instruction to return program control to the
HLT instruction. This option in effect causes the processor to re-enter the HALT state after handling the SMI.
(This is the default operation.)
* It can clear the auto HALT restart flag, which instructs the RSM instruction to return program control to the
instruction following the HLT instruction.

> Not to mention vfio hotplug
> should be rare, and we'll only trigger this once for the 1st device.
> 
> > 
> > I think KVM_REQ_UNBLOCK will add more lines of code.
> 
> It's very possible I overlooked something above... but if breaking HLT
> unregularly is okay, I attached one patch that is based on your v3 series, just
> dropped the vcpu_check_block() but use KVM_REQ_UNBLOCK (no compile test even,
> just to satisfy my own curiosity on how many loc we can save.. :), it gives me:
> 
>  7 files changed, 5 insertions(+), 41 deletions(-)
> 
> But again, I could have missed something...
> 
> Thanks,
> 
> > 
> > > The thing is current vmx_vcpu_check_block() is mostly a sanity check and
> > > copy-paste of the pi checks on a few items, so maybe cleaner to use
> > > KVM_REQ_UNBLOCK, as it might be reused in the future for re-evaluating of
> > > pre-block for similar purpose?
> > > 
> > > No strong opinion, though.
> > 
> > Hum... IMHO v3 is quite clean already (although i don't object to your
> > suggestion).
> > 
> > Paolo, what do you think?
> > 
> > 
> > 
> 
> -- 
> Peter Xu

> >From 1131248f3c8f1f2715dd49d439c9fab25b4db9b8 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 11 May 2021 17:33:21 -0400
> Subject: [PATCH] replace vcpu_check_block() hook with KVM_REQ_UNBLOCK
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |  1 -
>  arch/x86/include/asm/kvm_host.h    | 12 +-----------
>  arch/x86/kvm/svm/svm.c             |  1 -
>  arch/x86/kvm/vmx/posted_intr.c     | 27 +--------------------------
>  arch/x86/kvm/vmx/posted_intr.h     |  1 -
>  arch/x86/kvm/vmx/vmx.c             |  1 -
>  arch/x86/kvm/x86.c                 |  3 +++
>  7 files changed, 5 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index fc99fb779fd21..e7bef91cee04a 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -98,7 +98,6 @@ KVM_X86_OP_NULL(pre_block)
>  KVM_X86_OP_NULL(post_block)
>  KVM_X86_OP_NULL(vcpu_blocking)
>  KVM_X86_OP_NULL(vcpu_unblocking)
> -KVM_X86_OP_NULL(vcpu_check_block)
>  KVM_X86_OP_NULL(update_pi_irte)
>  KVM_X86_OP_NULL(start_assignment)
>  KVM_X86_OP_NULL(apicv_post_state_restore)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5bf7bd0e59582..74ab042e9b146 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -91,6 +91,7 @@
>  #define KVM_REQ_MSR_FILTER_CHANGED	KVM_ARCH_REQ(29)
>  #define KVM_REQ_UPDATE_CPU_DIRTY_LOGGING \
>  	KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> +#define KVM_REQ_UNBLOCK			KVM_ARCH_REQ(31)
>  
>  #define CR0_RESERVED_BITS                                               \
>  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> @@ -1350,8 +1351,6 @@ struct kvm_x86_ops {
>  	void (*vcpu_blocking)(struct kvm_vcpu *vcpu);
>  	void (*vcpu_unblocking)(struct kvm_vcpu *vcpu);
>  
> -	int (*vcpu_check_block)(struct kvm_vcpu *vcpu);
> -
>  	int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq,
>  			      uint32_t guest_irq, bool set);
>  	void (*start_assignment)(struct kvm *kvm);
> @@ -1835,15 +1834,6 @@ static inline bool kvm_irq_is_postable(struct kvm_lapic_irq *irq)
>  		irq->delivery_mode == APIC_DM_LOWEST);
>  }
>  
> -#define __KVM_HAVE_ARCH_VCPU_CHECK_BLOCK
> -static inline int kvm_arch_vcpu_check_block(struct kvm_vcpu *vcpu)
> -{
> -	if (kvm_x86_ops.vcpu_check_block)
> -		return static_call(kvm_x86_vcpu_check_block)(vcpu);
> -
> -	return 0;
> -}
> -
>  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu)
>  {
>  	static_call_cond(kvm_x86_vcpu_blocking)(vcpu);
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index cda5ccb4d9d1b..8b03795cfcd11 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4459,7 +4459,6 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>  	.vcpu_put = svm_vcpu_put,
>  	.vcpu_blocking = svm_vcpu_blocking,
>  	.vcpu_unblocking = svm_vcpu_unblocking,
> -	.vcpu_check_block = NULL,
>  
>  	.update_exception_bitmap = svm_update_exception_bitmap,
>  	.get_msr_feature = svm_get_msr_feature,
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 2d0d009965530..0b74d598ebcbd 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -203,32 +203,6 @@ void pi_post_block(struct kvm_vcpu *vcpu)
>  	local_irq_enable();
>  }
>  
> -/*
> - * Bail out of the block loop if the VM has an assigned
> - * device, but the blocking vCPU didn't reconfigure the
> - * PI.NV to the wakeup vector, i.e. the assigned device
> - * came along after the initial check in vcpu_block().
> - */
> -
> -int vmx_vcpu_check_block(struct kvm_vcpu *vcpu)
> -{
> -	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> -
> -	if (!irq_remapping_cap(IRQ_POSTING_CAP))
> -		return 0;
> -
> -	if (!kvm_vcpu_apicv_active(vcpu))
> -		return 0;
> -
> -	if (!kvm_arch_has_assigned_device(vcpu->kvm))
> -		return 0;
> -
> -	if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR)
> -		return 0;
> -
> -	return 1;
> -}
> -
>  /*
>   * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
>   */
> @@ -278,6 +252,7 @@ void vmx_pi_start_assignment(struct kvm *kvm)
>  		if (!kvm_vcpu_apicv_active(vcpu))
>  			continue;
>  
> +		kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
>  		kvm_vcpu_wake_up(vcpu);
>  	}
>  }
> diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
> index 2aa082fd1c7ab..7f7b2326caf53 100644
> --- a/arch/x86/kvm/vmx/posted_intr.h
> +++ b/arch/x86/kvm/vmx/posted_intr.h
> @@ -96,6 +96,5 @@ bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
>  int pi_update_irte(struct kvm *kvm, unsigned int host_irq, uint32_t guest_irq,
>  		   bool set);
>  void vmx_pi_start_assignment(struct kvm *kvm);
> -int vmx_vcpu_check_block(struct kvm_vcpu *vcpu);
>  
>  #endif /* __KVM_X86_VMX_POSTED_INTR_H */
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ab68fed8b7e43..639ec3eba9b80 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7716,7 +7716,6 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
>  
>  	.pre_block = vmx_pre_block,
>  	.post_block = vmx_post_block,
> -	.vcpu_check_block = vmx_vcpu_check_block,
>  
>  	.pmu_ops = &intel_pmu_ops,
>  	.nested_ops = &vmx_nested_ops,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e6fee59b5dab6..739e1bd59e8a9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11177,6 +11177,9 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>  	     static_call(kvm_x86_smi_allowed)(vcpu, false)))
>  		return true;
>  
> +	if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
> +		return true;
> +
>  	if (kvm_arch_interrupt_allowed(vcpu) &&
>  	    (kvm_cpu_has_interrupt(vcpu) ||
>  	    kvm_guest_apic_has_interrupt(vcpu)))
> -- 
> 2.31.1
>
Marcelo Tosatti May 12, 2021, 12:02 a.m. UTC | #10
On Tue, May 11, 2021 at 08:51:24PM -0300, Marcelo Tosatti wrote:
> On Tue, May 11, 2021 at 05:35:41PM -0400, Peter Xu wrote:
> > On Tue, May 11, 2021 at 02:18:10PM -0300, Marcelo Tosatti wrote:
> > > On Tue, May 11, 2021 at 12:19:56PM -0400, Peter Xu wrote:
> > > > On Tue, May 11, 2021 at 11:51:57AM -0300, Marcelo Tosatti wrote:
> > > > > On Tue, May 11, 2021 at 10:39:11AM -0400, Peter Xu wrote:
> > > > > > On Fri, May 07, 2021 at 07:08:31PM -0300, Marcelo Tosatti wrote:
> > > > > > > > Wondering whether we should add a pi_test_on() check in kvm_vcpu_has_events()
> > > > > > > > somehow, so that even without customized ->vcpu_check_block we should be able
> > > > > > > > to break the block loop (as kvm_arch_vcpu_runnable will return true properly)?
> > > > > > > 
> > > > > > > static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> > > > > > > {
> > > > > > >         int ret = -EINTR;
> > > > > > >         int idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > > > > > 
> > > > > > >         if (kvm_arch_vcpu_runnable(vcpu)) {
> > > > > > >                 kvm_make_request(KVM_REQ_UNHALT, vcpu); <---
> > > > > > >                 goto out;
> > > > > > >         }
> > > > > > > 
> > > > > > > Don't want to unhalt the vcpu.
> > > > > > 
> > > > > > Could you elaborate?  It's not obvious to me why we can't do that if
> > > > > > pi_test_on() returns true..  we have pending post interrupts anyways, so
> > > > > > shouldn't we stop halting?  Thanks!
> > > > > 
> > > > > pi_test_on() only returns true when an interrupt is signalled by the
> > > > > device. But the sequence of events is:
> > > > > 
> > > > > 
> > > > > 1. pCPU idles without notification vector configured to wakeup vector.
> > > > > 
> > > > > 2. PCI device is hotplugged, assigned device count increases from 0 to 1.
> > > > > 
> > > > > <arbitrary amount of time>
> > > > > 
> > > > > 3. device generates interrupt, sets ON bit to true in the posted
> > > > > interrupt descriptor.
> > > > > 
> > > > > We want to exit kvm_vcpu_block after 2, but before 3 (where ON bit
> > > > > is not set).
> > > > 
> > > > Ah yes.. thanks.
> > > > 
> > > > Besides the current approach, I'm thinking maybe it'll be cleaner/less LOC to
> > > > define a KVM_REQ_UNBLOCK to replace the pre_block hook (in x86's kvm_host.h):
> > > > 
> > > > #define KVM_REQ_UNBLOCK			KVM_ARCH_REQ(31)
> > > > 
> > > > We can set it in vmx_pi_start_assignment(), then check+clear it in
> > > > kvm_vcpu_has_events() (or make it a bool in kvm_vcpu struct?).
> > > 
> > > Can't check it in kvm_vcpu_has_events() because that will set
> > > KVM_REQ_UNHALT (which we don't want).
> > 
> > I thought it was okay to break the guest HLT? 
> 
> Intel:
> 
> "HLT-HALT
> 
> Description
> 
> Stops instruction execution and places the processor in a HALT state. An enabled interrupt (including NMI and
> SMI), a debug exception, the BINIT# signal, the INIT# signal, or the RESET# signal will resume execution. If an
> interrupt (including NMI) is used to resume execution after a HLT instruction, the saved instruction pointer
> (CS:EIP) points to the instruction following the HLT instruction."
> 
> AMD:
> 
> "6.5 Processor Halt
> The processor halt instruction (HLT) halts instruction execution, leaving the processor in the halt state.
> No registers or machine state are modified as a result of executing the HLT instruction. The processor
> remains in the halt state until one of the following occurs:
> • A non-maskable interrupt (NMI).
> • An enabled, maskable interrupt (INTR).
> • Processor reset (RESET).
> • Processor initialization (INIT).
> • System-management interrupt (SMI)."
> 
> The KVM_REQ_UNBLOCK patch will resume execution even any such event

						  even without any such event

> occuring. So the behaviour would be different from baremetal.
Peter Xu May 12, 2021, 12:38 a.m. UTC | #11
On Tue, May 11, 2021 at 09:02:59PM -0300, Marcelo Tosatti wrote:
> On Tue, May 11, 2021 at 08:51:24PM -0300, Marcelo Tosatti wrote:
> > On Tue, May 11, 2021 at 05:35:41PM -0400, Peter Xu wrote:
> > > On Tue, May 11, 2021 at 02:18:10PM -0300, Marcelo Tosatti wrote:
> > > > On Tue, May 11, 2021 at 12:19:56PM -0400, Peter Xu wrote:
> > > > > On Tue, May 11, 2021 at 11:51:57AM -0300, Marcelo Tosatti wrote:
> > > > > > On Tue, May 11, 2021 at 10:39:11AM -0400, Peter Xu wrote:
> > > > > > > On Fri, May 07, 2021 at 07:08:31PM -0300, Marcelo Tosatti wrote:
> > > > > > > > > Wondering whether we should add a pi_test_on() check in kvm_vcpu_has_events()
> > > > > > > > > somehow, so that even without customized ->vcpu_check_block we should be able
> > > > > > > > > to break the block loop (as kvm_arch_vcpu_runnable will return true properly)?
> > > > > > > > 
> > > > > > > > static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> > > > > > > > {
> > > > > > > >         int ret = -EINTR;
> > > > > > > >         int idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > > > > > > 
> > > > > > > >         if (kvm_arch_vcpu_runnable(vcpu)) {
> > > > > > > >                 kvm_make_request(KVM_REQ_UNHALT, vcpu); <---
> > > > > > > >                 goto out;
> > > > > > > >         }
> > > > > > > > 
> > > > > > > > Don't want to unhalt the vcpu.
> > > > > > > 
> > > > > > > Could you elaborate?  It's not obvious to me why we can't do that if
> > > > > > > pi_test_on() returns true..  we have pending post interrupts anyways, so
> > > > > > > shouldn't we stop halting?  Thanks!
> > > > > > 
> > > > > > pi_test_on() only returns true when an interrupt is signalled by the
> > > > > > device. But the sequence of events is:
> > > > > > 
> > > > > > 
> > > > > > 1. pCPU idles without notification vector configured to wakeup vector.
> > > > > > 
> > > > > > 2. PCI device is hotplugged, assigned device count increases from 0 to 1.
> > > > > > 
> > > > > > <arbitrary amount of time>
> > > > > > 
> > > > > > 3. device generates interrupt, sets ON bit to true in the posted
> > > > > > interrupt descriptor.
> > > > > > 
> > > > > > We want to exit kvm_vcpu_block after 2, but before 3 (where ON bit
> > > > > > is not set).
> > > > > 
> > > > > Ah yes.. thanks.
> > > > > 
> > > > > Besides the current approach, I'm thinking maybe it'll be cleaner/less LOC to
> > > > > define a KVM_REQ_UNBLOCK to replace the pre_block hook (in x86's kvm_host.h):
> > > > > 
> > > > > #define KVM_REQ_UNBLOCK			KVM_ARCH_REQ(31)
> > > > > 
> > > > > We can set it in vmx_pi_start_assignment(), then check+clear it in
> > > > > kvm_vcpu_has_events() (or make it a bool in kvm_vcpu struct?).
> > > > 
> > > > Can't check it in kvm_vcpu_has_events() because that will set
> > > > KVM_REQ_UNHALT (which we don't want).
> > > 
> > > I thought it was okay to break the guest HLT? 
> > 
> > Intel:
> > 
> > "HLT-HALT
> > 
> > Description
> > 
> > Stops instruction execution and places the processor in a HALT state. An enabled interrupt (including NMI and
> > SMI), a debug exception, the BINIT# signal, the INIT# signal, or the RESET# signal will resume execution. If an
> > interrupt (including NMI) is used to resume execution after a HLT instruction, the saved instruction pointer
> > (CS:EIP) points to the instruction following the HLT instruction."
> > 
> > AMD:
> > 
> > "6.5 Processor Halt
> > The processor halt instruction (HLT) halts instruction execution, leaving the processor in the halt state.
> > No registers or machine state are modified as a result of executing the HLT instruction. The processor
> > remains in the halt state until one of the following occurs:
> > • A non-maskable interrupt (NMI).
> > • An enabled, maskable interrupt (INTR).
> > • Processor reset (RESET).
> > • Processor initialization (INIT).
> > • System-management interrupt (SMI)."
> > 
> > The KVM_REQ_UNBLOCK patch will resume execution even any such event
> 
> 						  even without any such event
> 
> > occuring. So the behaviour would be different from baremetal.
> 

What if we move that kvm_check_request() into kvm_vcpu_check_block()?

---8<---
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 739e1bd59e8a9..e6fee59b5dab6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11177,9 +11177,6 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
             static_call(kvm_x86_smi_allowed)(vcpu, false)))
                return true;
 
-       if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
-               return true;
-
        if (kvm_arch_interrupt_allowed(vcpu) &&
            (kvm_cpu_has_interrupt(vcpu) ||
            kvm_guest_apic_has_interrupt(vcpu)))
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f68035355c08a..fc5f6bffff7fc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2925,6 +2925,10 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
                kvm_make_request(KVM_REQ_UNHALT, vcpu);
                goto out;
        }
+#ifdef CONFIG_X86
+       if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
+               return true;
+#endif
        if (kvm_cpu_has_pending_timer(vcpu))
                goto out;
        if (signal_pending(current))
---8<---

(The CONFIG_X86 is ugly indeed.. but just to show what I meant, e.g. it can be
 a boolean too I think)

Would this work?

Thanks,
Marcelo Tosatti May 12, 2021, 11:10 a.m. UTC | #12
On Tue, May 11, 2021 at 08:38:16PM -0400, Peter Xu wrote:
> On Tue, May 11, 2021 at 09:02:59PM -0300, Marcelo Tosatti wrote:
> > On Tue, May 11, 2021 at 08:51:24PM -0300, Marcelo Tosatti wrote:
> > > On Tue, May 11, 2021 at 05:35:41PM -0400, Peter Xu wrote:
> > > > On Tue, May 11, 2021 at 02:18:10PM -0300, Marcelo Tosatti wrote:
> > > > > On Tue, May 11, 2021 at 12:19:56PM -0400, Peter Xu wrote:
> > > > > > On Tue, May 11, 2021 at 11:51:57AM -0300, Marcelo Tosatti wrote:
> > > > > > > On Tue, May 11, 2021 at 10:39:11AM -0400, Peter Xu wrote:
> > > > > > > > On Fri, May 07, 2021 at 07:08:31PM -0300, Marcelo Tosatti wrote:
> > > > > > > > > > Wondering whether we should add a pi_test_on() check in kvm_vcpu_has_events()
> > > > > > > > > > somehow, so that even without customized ->vcpu_check_block we should be able
> > > > > > > > > > to break the block loop (as kvm_arch_vcpu_runnable will return true properly)?
> > > > > > > > > 
> > > > > > > > > static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> > > > > > > > > {
> > > > > > > > >         int ret = -EINTR;
> > > > > > > > >         int idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > > > > > > > 
> > > > > > > > >         if (kvm_arch_vcpu_runnable(vcpu)) {
> > > > > > > > >                 kvm_make_request(KVM_REQ_UNHALT, vcpu); <---
> > > > > > > > >                 goto out;
> > > > > > > > >         }
> > > > > > > > > 
> > > > > > > > > Don't want to unhalt the vcpu.
> > > > > > > > 
> > > > > > > > Could you elaborate?  It's not obvious to me why we can't do that if
> > > > > > > > pi_test_on() returns true..  we have pending post interrupts anyways, so
> > > > > > > > shouldn't we stop halting?  Thanks!
> > > > > > > 
> > > > > > > pi_test_on() only returns true when an interrupt is signalled by the
> > > > > > > device. But the sequence of events is:
> > > > > > > 
> > > > > > > 
> > > > > > > 1. pCPU idles without notification vector configured to wakeup vector.
> > > > > > > 
> > > > > > > 2. PCI device is hotplugged, assigned device count increases from 0 to 1.
> > > > > > > 
> > > > > > > <arbitrary amount of time>
> > > > > > > 
> > > > > > > 3. device generates interrupt, sets ON bit to true in the posted
> > > > > > > interrupt descriptor.
> > > > > > > 
> > > > > > > We want to exit kvm_vcpu_block after 2, but before 3 (where ON bit
> > > > > > > is not set).
> > > > > > 
> > > > > > Ah yes.. thanks.
> > > > > > 
> > > > > > Besides the current approach, I'm thinking maybe it'll be cleaner/less LOC to
> > > > > > define a KVM_REQ_UNBLOCK to replace the pre_block hook (in x86's kvm_host.h):
> > > > > > 
> > > > > > #define KVM_REQ_UNBLOCK			KVM_ARCH_REQ(31)
> > > > > > 
> > > > > > We can set it in vmx_pi_start_assignment(), then check+clear it in
> > > > > > kvm_vcpu_has_events() (or make it a bool in kvm_vcpu struct?).
> > > > > 
> > > > > Can't check it in kvm_vcpu_has_events() because that will set
> > > > > KVM_REQ_UNHALT (which we don't want).
> > > > 
> > > > I thought it was okay to break the guest HLT? 
> > > 
> > > Intel:
> > > 
> > > "HLT-HALT
> > > 
> > > Description
> > > 
> > > Stops instruction execution and places the processor in a HALT state. An enabled interrupt (including NMI and
> > > SMI), a debug exception, the BINIT# signal, the INIT# signal, or the RESET# signal will resume execution. If an
> > > interrupt (including NMI) is used to resume execution after a HLT instruction, the saved instruction pointer
> > > (CS:EIP) points to the instruction following the HLT instruction."
> > > 
> > > AMD:
> > > 
> > > "6.5 Processor Halt
> > > The processor halt instruction (HLT) halts instruction execution, leaving the processor in the halt state.
> > > No registers or machine state are modified as a result of executing the HLT instruction. The processor
> > > remains in the halt state until one of the following occurs:
> > > • A non-maskable interrupt (NMI).
> > > • An enabled, maskable interrupt (INTR).
> > > • Processor reset (RESET).
> > > • Processor initialization (INIT).
> > > • System-management interrupt (SMI)."
> > > 
> > > The KVM_REQ_UNBLOCK patch will resume execution even any such event
> > 
> > 						  even without any such event
> > 
> > > occuring. So the behaviour would be different from baremetal.
> > 
> 
> What if we move that kvm_check_request() into kvm_vcpu_check_block()?
> 
> ---8<---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 739e1bd59e8a9..e6fee59b5dab6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11177,9 +11177,6 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>              static_call(kvm_x86_smi_allowed)(vcpu, false)))
>                 return true;
>  
> -       if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
> -               return true;
> -
>         if (kvm_arch_interrupt_allowed(vcpu) &&
>             (kvm_cpu_has_interrupt(vcpu) ||
>             kvm_guest_apic_has_interrupt(vcpu)))
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f68035355c08a..fc5f6bffff7fc 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2925,6 +2925,10 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
>                 kvm_make_request(KVM_REQ_UNHALT, vcpu);
>                 goto out;
>         }
> +#ifdef CONFIG_X86
> +       if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
> +               return true;
> +#endif
>         if (kvm_cpu_has_pending_timer(vcpu))
>                 goto out;
>         if (signal_pending(current))
> ---8<---
> 
> (The CONFIG_X86 is ugly indeed.. but just to show what I meant, e.g. it can be
>  a boolean too I think)
> 
> Would this work?

That would work: but vcpu->requests are nicely checked (and processed) 
at vcpu_enter_guest, before guest entry. The proposed request does not 
follow that pattern.
Sean Christopherson May 12, 2021, 2:41 p.m. UTC | #13
On Tue, May 11, 2021, Marcelo Tosatti wrote:
> > The KVM_REQ_UNBLOCK patch will resume execution even any such event
> 
> 						  even without any such event
> 
> > occuring. So the behaviour would be different from baremetal.

I agree with Marcelo, we don't want to spuriously unhalt the vCPU.  It's legal,
albeit risky, to do something like

	hlt
	/* #UD to triple fault if this CPU is awakened. */
	ud2

when offlining a CPU, in which case the spurious wake event will crash the guest.
Peter Xu May 12, 2021, 3:34 p.m. UTC | #14
On Wed, May 12, 2021 at 02:41:56PM +0000, Sean Christopherson wrote:
> On Tue, May 11, 2021, Marcelo Tosatti wrote:
> > > The KVM_REQ_UNBLOCK patch will resume execution even any such event
> > 
> > 						  even without any such event
> > 
> > > occuring. So the behaviour would be different from baremetal.
> 
> I agree with Marcelo, we don't want to spuriously unhalt the vCPU.  It's legal,
> albeit risky, to do something like
> 
> 	hlt
> 	/* #UD to triple fault if this CPU is awakened. */
> 	ud2
> 
> when offlining a CPU, in which case the spurious wake event will crash the guest.

We can avoid that by moving the check+clear of KVM_REQ_UNBLOCK from
kvm_vcpu_has_events() into kvm_vcpu_check_block() as replied in the other
thread.  But I also agree Marcelo's series should work already to fix the bug,
hence no strong opinion on this.

Thanks,
diff mbox series

Patch

Index: kvm/arch/x86/kvm/vmx/posted_intr.c
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/posted_intr.c
+++ kvm/arch/x86/kvm/vmx/posted_intr.c
@@ -203,6 +203,25 @@  void pi_post_block(struct kvm_vcpu *vcpu
 	local_irq_enable();
 }
 
+int vmx_vcpu_check_block(struct kvm_vcpu *vcpu)
+{
+	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+
+	if (!irq_remapping_cap(IRQ_POSTING_CAP))
+		return 0;
+
+	if (!kvm_vcpu_apicv_active(vcpu))
+		return 0;
+
+	if (!kvm_arch_has_assigned_device(vcpu->kvm))
+		return 0;
+
+	if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR)
+		return 0;
+
+	return 1;
+}
+
 /*
  * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
  */
@@ -236,6 +255,26 @@  bool pi_has_pending_interrupt(struct kvm
 		(pi_test_sn(pi_desc) && !pi_is_pir_empty(pi_desc));
 }
 
+void vmx_pi_start_assignment(struct kvm *kvm, int device_count)
+{
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	if (!irq_remapping_cap(IRQ_POSTING_CAP))
+		return;
+
+	/* only care about first device assignment */
+	if (device_count != 1)
+		return;
+
+	/* Update wakeup vector and add vcpu to blocked_vcpu_list */
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (!kvm_vcpu_apicv_active(vcpu))
+			continue;
+
+		kvm_vcpu_kick(vcpu);
+	}
+}
 
 /*
  * pi_update_irte - set IRTE for Posted-Interrupts
Index: kvm/arch/x86/kvm/vmx/posted_intr.h
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/posted_intr.h
+++ kvm/arch/x86/kvm/vmx/posted_intr.h
@@ -95,5 +95,7 @@  void __init pi_init_cpu(int cpu);
 bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int pi_update_irte(struct kvm *kvm, unsigned int host_irq, uint32_t guest_irq,
 		   bool set);
+void vmx_pi_start_assignment(struct kvm *kvm, int device_count);
+int vmx_vcpu_check_block(struct kvm_vcpu *vcpu);
 
 #endif /* __KVM_X86_VMX_POSTED_INTR_H */
Index: kvm/arch/x86/kvm/vmx/vmx.c
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/vmx.c
+++ kvm/arch/x86/kvm/vmx/vmx.c
@@ -7727,13 +7727,13 @@  static struct kvm_x86_ops vmx_x86_ops __
 
 	.pre_block = vmx_pre_block,
 	.post_block = vmx_post_block,
-	.vcpu_check_block = NULL,
+	.vcpu_check_block = vmx_vcpu_check_block,
 
 	.pmu_ops = &intel_pmu_ops,
 	.nested_ops = &vmx_nested_ops,
 
 	.update_pi_irte = pi_update_irte,
-	.start_assignment = NULL,
+	.start_assignment = vmx_pi_start_assignment,
 
 #ifdef CONFIG_X86_64
 	.set_hv_timer = vmx_set_hv_timer,