diff mbox

[v3,24/26] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

Message ID 1418397300-10870-25-git-send-email-feng.wu@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Wu, Feng Dec. 12, 2014, 3:14 p.m. UTC
This patch updates the Posted-Interrupts Descriptor when vCPU
is blocked.

pre-block:
- Add the vCPU to the blocked per-CPU list
- Clear 'SN'
- Set 'NV' to POSTED_INTR_WAKEUP_VECTOR

post-block:
- Remove the vCPU from the per-CPU list

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +
 arch/x86/kvm/vmx.c              | 96 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              | 22 +++++++---
 include/linux/kvm_host.h        |  4 ++
 virt/kvm/kvm_main.c             |  6 +++
 5 files changed, 123 insertions(+), 7 deletions(-)

Comments

Paolo Bonzini Dec. 17, 2014, 5:09 p.m. UTC | #1
On 12/12/2014 16:14, Feng Wu wrote:
> This patch updates the Posted-Interrupts Descriptor when vCPU
> is blocked.
> 
> pre-block:
> - Add the vCPU to the blocked per-CPU list
> - Clear 'SN'

Should SN be already clear (and NV set to POSTED_INTR_VECTOR)?  Can it
happen that you go from sched-out to blocked without doing a sched-in first?

In fact, if this is possible, what happens if vcpu->preempted &&
vcpu->blocked?

> - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> 
> post-block:
> - Remove the vCPU from the per-CPU list

Paolo

> Signed-off-by: Feng Wu <feng.wu@intel.com>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu, Feng Dec. 18, 2014, 3:16 a.m. UTC | #2
> -----Original Message-----
> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On
> Behalf Of Paolo Bonzini
> Sent: Thursday, December 18, 2014 1:10 AM
> To: Wu, Feng; Thomas Gleixner; Ingo Molnar; H. Peter Anvin; x86@kernel.org;
> Gleb Natapov; Paolo Bonzini; dwmw2@infradead.org; joro@8bytes.org; Alex
> Williamson
> Cc: iommu@lists.linux-foundation.org; linux-kernel@vger.kernel.org; KVM list;
> Eric Auger
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
> 
> 
> 
> On 12/12/2014 16:14, Feng Wu wrote:
> > This patch updates the Posted-Interrupts Descriptor when vCPU
> > is blocked.
> >
> > pre-block:
> > - Add the vCPU to the blocked per-CPU list
> > - Clear 'SN'
> 
> Should SN be already clear (and NV set to POSTED_INTR_VECTOR)? 

I think the SN bit should be clear here, Adding it here is just to make sure
SN is clear when vCPU is blocked, so it can receive wakeup notification event later.

> Can it
> happen that you go from sched-out to blocked without doing a sched-in first?
> 

I cannot imagine this scenario, can you please be more specific? Thanks a lot!

> In fact, if this is possible, what happens if vcpu->preempted &&
> vcpu->blocked?

In fact, vcpu->preempted && vcpu->blocked happens sometimes, but I think there is
no issues. Please refer to the following case:

kvm_vcpu_block() 
	-> vcpu->blocked = true;
	-> prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);

	before schedule() is called, this vcpu is woken up by another guy, so
	the state of the vcpu associated thread is changed to TASK_RUNNING,
	then preemption happens after interrupts or the following schedule() is
	hit, this will call kvm_sched_out(), in which current->state == TASK_RUNNING
	and vcpu->preempted is set to true. So now vcpu->preempted and vcpu->blocked
	are both true. In vmx_vcpu_put(), we will check vcpu->preempted first, so
	the vCPU will not be blocked, and the vcpu->blocked will be set the false in
	vmx_vcpu_load().

	But maybe I need do a little change to the vmx_vcpu_load() like below:

                /*
                 * Delete the vCPU from the related wakeup queue
                 * if we are resuming from blocked state
                 */
                if (vcpu->blocked) {
                        vcpu->blocked = false;
+						/* if wakeup_cpu == -1, the vcpu is currently not blocked on any
+						  pCPU, don't need dequeue here */
+						if (vcpu->wakeup_cpu != -1) {
               		         spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
                                vcpu->wakeup_cpu), flags);
                    	     list_del(&vcpu->blocked_vcpu_list);
                        	 spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
                                vcpu->wakeup_cpu), flags);
                        	 vcpu->wakeup_cpu = -1;
+						}
                }

Any ideas about this? Thanks a lot!

Thanks,
Feng


	-> schedule();


> 
> > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> >
> > post-block:
> > - Remove the vCPU from the per-CPU list
> 
> Paolo
> 
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Dec. 18, 2014, 8:37 a.m. UTC | #3
On 18/12/2014 04:16, Wu, Feng wrote:
>>> pre-block:
>>> - Add the vCPU to the blocked per-CPU list
>>> - Clear 'SN'
>>
>> Should SN be already clear (and NV set to POSTED_INTR_VECTOR)? 
> 
> I think the SN bit should be clear here, Adding it here is just to make sure
> SN is clear when vCPU is blocked, so it can receive wakeup notification event later.

Then, please, WARN if the SN bit is set inside the if (vcpu->blocked).
Inside that if you can just add the vCPU to the blocked list on vcpu_put.

>> Can it
>> happen that you go from sched-out to blocked without doing a sched-in first?
>>
> 
> I cannot imagine this scenario, can you please be more specific? Thanks a lot!

I cannot either. :)  But it would be the case where SN is not cleared.
So we agree that it cannot happen.

>> In fact, if this is possible, what happens if vcpu->preempted &&
>> vcpu->blocked?
> 
> In fact, vcpu->preempted && vcpu->blocked happens sometimes, but I think there is
> no issues. Please refer to the following case:

I agree that there should be no issues.  But if it can happen, it's better:

1) to separate the handling of preemption and blocking: preemption
handles SN/NV/NDST, blocking handles the wakeup list.

2) to change this

+		} else if (vcpu->blocked) {
+			/*
+			 * The vcpu is blocked on the wait queue.
+			 * Store the blocked vCPU on the list of the
+			 * vcpu->wakeup_cpu, which is the destination
+			 * of the wake-up notification event.

to just

		}
		if (vcpu->blocked) {
			...
		}
> kvm_vcpu_block() 
> 	-> vcpu->blocked = true;
> 	-> prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
> 
> 	before schedule() is called, this vcpu is woken up by another guy, so
> 	the state of the vcpu associated thread is changed to TASK_RUNNING,
> 	then preemption happens after interrupts or the following schedule() is
> 	hit, this will call kvm_sched_out(), in which current->state == TASK_RUNNING
> 	and vcpu->preempted is set to true. So now vcpu->preempted and vcpu->blocked
> 	are both true. In vmx_vcpu_put(), we will check vcpu->preempted first, so
> 	the vCPU will not be blocked, and the vcpu->blocked will be set the false in
> 	vmx_vcpu_load().
> 
> 	But maybe I need do a little change to the vmx_vcpu_load() like below:
> 
>                 /*
>                  * Delete the vCPU from the related wakeup queue
>                  * if we are resuming from blocked state
>                  */
>                 if (vcpu->blocked) {
>                         vcpu->blocked = false;
> +						/* if wakeup_cpu == -1, the vcpu is currently not blocked on any
> +						  pCPU, don't need dequeue here */
> +						if (vcpu->wakeup_cpu != -1) {
>                		         spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
>                                 vcpu->wakeup_cpu), flags);
>                     	     list_del(&vcpu->blocked_vcpu_list);
>                         	 spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
>                                 vcpu->wakeup_cpu), flags);
>                         	 vcpu->wakeup_cpu = -1;
> +						}
>                 }

Good idea.

Paolo

> Any ideas about this? Thanks a lot!
> 
> Thanks,
> Feng
> 
> 
> 	-> schedule();
> 
> 
>>
>>> - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
>>>
>>> post-block:
>>> - Remove the vCPU from the per-CPU list
>>
>> Paolo
>>
>>> Signed-off-by: Feng Wu <feng.wu@intel.com>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu, Feng Dec. 19, 2014, 2:51 a.m. UTC | #4
> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Paolo Bonzini
> Sent: Thursday, December 18, 2014 4:37 PM
> To: linux-kernel@vger.kernel.org
> Cc: iommu@lists.linux-foundation.org; kvm@vger.kernel.org;
> linux-kernel@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
> 
> 
> 
> On 18/12/2014 04:16, Wu, Feng wrote:
> >>> pre-block:
> >>> - Add the vCPU to the blocked per-CPU list
> >>> - Clear 'SN'
> >>
> >> Should SN be already clear (and NV set to POSTED_INTR_VECTOR)?
> >
> > I think the SN bit should be clear here, Adding it here is just to make sure
> > SN is clear when vCPU is blocked, so it can receive wakeup notification event
> later.
> 
> Then, please, WARN if the SN bit is set inside the if (vcpu->blocked).
> Inside that if you can just add the vCPU to the blocked list on vcpu_put.
> 
> >> Can it
> >> happen that you go from sched-out to blocked without doing a sched-in
> first?
> >>
> >
> > I cannot imagine this scenario, can you please be more specific? Thanks a lot!
> 
> I cannot either. :)  But it would be the case where SN is not cleared.
> So we agree that it cannot happen.
> 
> >> In fact, if this is possible, what happens if vcpu->preempted &&
> >> vcpu->blocked?
> >
> > In fact, vcpu->preempted && vcpu->blocked happens sometimes, but I think
> there is
> > no issues. Please refer to the following case:
> 
> I agree that there should be no issues.  But if it can happen, it's better:
> 
> 1) to separate the handling of preemption and blocking: preemption
> handles SN/NV/NDST, blocking handles the wakeup list.
> 
Sorry, I don't quite understand this.

I think handling of preemption and blocking is separated in vmx_vcpu_put().
For vmx_vcpu_load(), the handling of SN/NV/NDST is common for preemption
and blocking.

Thanks,
Feng

> 2) to change this
> 
> +		} else if (vcpu->blocked) {
> +			/*
> +			 * The vcpu is blocked on the wait queue.
> +			 * Store the blocked vCPU on the list of the
> +			 * vcpu->wakeup_cpu, which is the destination
> +			 * of the wake-up notification event.
> 
> to just
> 
> 		}
> 		if (vcpu->blocked) {
> 			...
> 		}
> > kvm_vcpu_block()
> > 	-> vcpu->blocked = true;
> > 	-> prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
> >
> > 	before schedule() is called, this vcpu is woken up by another guy, so
> > 	the state of the vcpu associated thread is changed to TASK_RUNNING,
> > 	then preemption happens after interrupts or the following schedule() is
> > 	hit, this will call kvm_sched_out(), in which current->state ==
> TASK_RUNNING
> > 	and vcpu->preempted is set to true. So now vcpu->preempted and
> vcpu->blocked
> > 	are both true. In vmx_vcpu_put(), we will check vcpu->preempted first, so
> > 	the vCPU will not be blocked, and the vcpu->blocked will be set the false in
> > 	vmx_vcpu_load().
> >
> > 	But maybe I need do a little change to the vmx_vcpu_load() like below:
> >
> >                 /*
> >                  * Delete the vCPU from the related wakeup queue
> >                  * if we are resuming from blocked state
> >                  */
> >                 if (vcpu->blocked) {
> >                         vcpu->blocked = false;
> > +						/* if wakeup_cpu == -1, the vcpu is currently not
> blocked on any
> > +						  pCPU, don't need dequeue here */
> > +						if (vcpu->wakeup_cpu != -1) {
> >
> spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> >                                 vcpu->wakeup_cpu), flags);
> >                     	     list_del(&vcpu->blocked_vcpu_list);
> >
> spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
> >                                 vcpu->wakeup_cpu), flags);
> >                         	 vcpu->wakeup_cpu = -1;
> > +						}
> >                 }
> 
> Good idea.
> 
> Paolo
> 
> > Any ideas about this? Thanks a lot!
> >
> > Thanks,
> > Feng
> >
> >
> > 	-> schedule();
> >
> >
> >>
> >>> - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> >>>
> >>> post-block:
> >>> - Remove the vCPU from the per-CPU list
> >>
> >> Paolo
> >>
> >>> Signed-off-by: Feng Wu <feng.wu@intel.com>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Feb. 25, 2015, 9:50 p.m. UTC | #5
On Fri, Dec 12, 2014 at 11:14:58PM +0800, Feng Wu wrote:
> This patch updates the Posted-Interrupts Descriptor when vCPU
> is blocked.
> 
> pre-block:
> - Add the vCPU to the blocked per-CPU list
> - Clear 'SN'
> - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> 
> post-block:
> - Remove the vCPU from the per-CPU list
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---

Don't see this is needed, can use the existing POSTED_INTR_VECTOR:

If in guest mode, IPI will be handled in VMX non-root by performed
PIR->IRR transfer.

If outside guest mode, POSTED_INTR_VECTOR IPI will be handled by host
which can wakeup the guest (in case it is halted).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu, Feng Feb. 26, 2015, 8:08 a.m. UTC | #6
> -----Original Message-----
> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Thursday, February 26, 2015 5:50 AM
> To: Wu, Feng
> Cc: tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com; x86@kernel.org;
> gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
> 
> On Fri, Dec 12, 2014 at 11:14:58PM +0800, Feng Wu wrote:
> > This patch updates the Posted-Interrupts Descriptor when vCPU
> > is blocked.
> >
> > pre-block:
> > - Add the vCPU to the blocked per-CPU list
> > - Clear 'SN'
> > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> >
> > post-block:
> > - Remove the vCPU from the per-CPU list
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> 
> Don't see this is needed, can use the existing POSTED_INTR_VECTOR:
> 
> If in guest mode, IPI will be handled in VMX non-root by performed
> PIR->IRR transfer.
> 
> If outside guest mode, POSTED_INTR_VECTOR IPI will be handled by host
> which can wakeup the guest (in case it is halted).

Please see the following scenario:

1. vCPU0 is running on pCPU0
2. vCPU0 is halted and vCPU1 is currently running on pCPU0
3. An interrupt occurs for vCPU0, if we still use POSTED_INTR_VECTOR
for vCPU0, the notification event for vCPU0 (the event will go to pCPU1)
will be consumed by vCPU1 incorrectly. The worst case is that vCPU0
will never be woken up again since the wakeup event for it is always
consumed by other vCPUs incorrectly.

Thanks,
Feng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Feb. 26, 2015, 11:40 p.m. UTC | #7
On Fri, Dec 12, 2014 at 11:14:58PM +0800, Feng Wu wrote:
> This patch updates the Posted-Interrupts Descriptor when vCPU
> is blocked.
> 
> pre-block:
> - Add the vCPU to the blocked per-CPU list
> - Clear 'SN'
> - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> 
> post-block:
> - Remove the vCPU from the per-CPU list
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +
>  arch/x86/kvm/vmx.c              | 96 +++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/x86.c              | 22 +++++++---
>  include/linux/kvm_host.h        |  4 ++
>  virt/kvm/kvm_main.c             |  6 +++
>  5 files changed, 123 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 13e3e40..32c110a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -101,6 +101,8 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
>  
>  #define ASYNC_PF_PER_VCPU 64
>  
> +extern void (*wakeup_handler_callback)(void);
> +
>  enum kvm_reg {
>  	VCPU_REGS_RAX = 0,
>  	VCPU_REGS_RCX = 1,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index bf2e6cd..a1c83a2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -832,6 +832,13 @@ static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
>  static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
>  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
>  
> +/*
> + * We maintian a per-CPU linked-list of vCPU, so in wakeup_handler() we
> + * can find which vCPU should be waken up.
> + */
> +static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
> +static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
> +
>  static unsigned long *vmx_io_bitmap_a;
>  static unsigned long *vmx_io_bitmap_b;
>  static unsigned long *vmx_msr_bitmap_legacy;
> @@ -1921,6 +1928,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>  		struct pi_desc old, new;
>  		unsigned int dest;
> +		unsigned long flags;
>  
>  		memset(&old, 0, sizeof(old));
>  		memset(&new, 0, sizeof(new));
> @@ -1942,6 +1950,20 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  			new.nv = POSTED_INTR_VECTOR;
>  		} while (cmpxchg(&pi_desc->control, old.control,
>  				new.control) != old.control);
> +
> +		/*
> +		 * Delete the vCPU from the related wakeup queue
> +		 * if we are resuming from blocked state
> +		 */
> +		if (vcpu->blocked) {
> +			vcpu->blocked = false;
> +			spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> +				vcpu->wakeup_cpu), flags);
> +			list_del(&vcpu->blocked_vcpu_list);
> +			spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
> +				vcpu->wakeup_cpu), flags);
> +			vcpu->wakeup_cpu = -1;
> +		}
>  	}
>  }
>  
> @@ -1950,6 +1972,9 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
>  	if (irq_remapping_cap(IRQ_POSTING_CAP)) {
>  		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
>  		struct pi_desc old, new;
> +		unsigned long flags;
> +		int cpu;
> +		struct cpumask cpu_others_mask;
>  
>  		memset(&old, 0, sizeof(old));
>  		memset(&new, 0, sizeof(new));
> @@ -1961,6 +1986,54 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
>  				pi_set_sn(&new);
>  			} while (cmpxchg(&pi_desc->control, old.control,
>  					new.control) != old.control);
> +		} else if (vcpu->blocked) {
> +			/*
> +			 * The vcpu is blocked on the wait queue.
> +			 * Store the blocked vCPU on the list of the
> +			 * vcpu->wakeup_cpu, which is the destination
> +			 * of the wake-up notification event.
> +			 */
> +			vcpu->wakeup_cpu = vcpu->cpu;
> +			spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> +					  vcpu->wakeup_cpu), flags);
> +			list_add_tail(&vcpu->blocked_vcpu_list,
> +				      &per_cpu(blocked_vcpu_on_cpu,
> +				      vcpu->wakeup_cpu));
> +			spin_unlock_irqrestore(
> +					&per_cpu(blocked_vcpu_on_cpu_lock,
> +					vcpu->wakeup_cpu), flags);
> +
> +			do {
> +				old.control = new.control = pi_desc->control;
> +
> +				/*
> +				 * We should not block the vCPU if
> +				 * an interrupt is posted for it.
> +				 */
> +				if (pi_test_on(pi_desc) == 1) {
> +					/*
> +					 * We need schedule the wakeup worker
> +					 * on a different cpu other than
> +					 * vcpu->cpu, because in some case,
> +					 * schedule_work() will call
> +					 * try_to_wake_up() which needs acquire
> +					 * the rq lock. This can cause deadlock.
> +					 */
> +					cpumask_copy(&cpu_others_mask,
> +						     cpu_online_mask);
> +					cpu_clear(vcpu->cpu, cpu_others_mask);
> +					cpu = any_online_cpu(cpu_others_mask);
> +
> +					schedule_work_on(cpu,
> +							 &vcpu->wakeup_worker);
> +				}
> +
> +				pi_clear_sn(&new);
> +
> +				/* set 'NV' to 'wakeup vector' */
> +				new.nv = POSTED_INTR_WAKEUP_VECTOR;
> +			} while (cmpxchg(&pi_desc->control, old.control,
> +				new.control) != old.control);
>  		}

This can be done exclusively on HLT emulation, correct? (that is, on
entry to HLT and exit from HLT).

If the vcpu is scheduled out for any other reason (transition to
userspace or transition to other thread), it will eventually resume
execution. And in that case, continuation of execution does not depend
on the event (VT-d interrupt) notification.

There is a race window with the code above, I believe.

>  	}
>  
> @@ -2842,6 +2915,8 @@ static int hardware_enable(void)
>  		return -EBUSY;
>  
>  	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> +	INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
> +	spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
>  
>  	/*
>  	 * Now we can enable the vmclear operation in kdump
> @@ -9315,6 +9390,25 @@ static struct kvm_x86_ops vmx_x86_ops = {
>  	.pi_set_sn = vmx_pi_set_sn,
>  };
>  
> +/*
> + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
> + */
> +void wakeup_handler(void)
> +{
> +	struct kvm_vcpu *vcpu;
> +	int cpu = smp_processor_id();
> +
> +	spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> +	list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
> +			blocked_vcpu_list) {
> +		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> +
> +		if (pi_test_on(pi_desc) == 1)
> +			kvm_vcpu_kick(vcpu);
> +	}
> +	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> +}

Looping through all blocked vcpus does not scale:
Can you allocate more vectors and then multiplex those
vectors amongst the HLT'ed vcpus? 

It seems there is a bunch free:

commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4
Author: Alex Shi <alex.shi@intel.com>
Date:   Thu Jun 28 09:02:23 2012 +0800

    x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR

Can you add only vcpus which have posted IRTEs that point to this pCPU
to the HLT'ed vcpu lists? (so for example, vcpus without assigned
devices are not part of the list).

> +
>  static int __init vmx_init(void)
>  {
>  	int r, i, msr;
> @@ -9429,6 +9523,8 @@ static int __init vmx_init(void)
>  
>  	update_ple_window_actual_max();
>  
> +	wakeup_handler_callback = wakeup_handler;
> +
>  	return 0;
>  
>  out7:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0033df3..1551a46 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6152,6 +6152,21 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  			kvm_vcpu_reload_apic_access_page(vcpu);
>  	}
>  
> +	/*
> +	 * Since posted-interrupts can be set by VT-d HW now, in this
> +	 * case, KVM_REQ_EVENT is not set. We move the following
> +	 * operations out of the if statement.
> +	 */
> +	if (kvm_lapic_enabled(vcpu)) {
> +		/*
> +		 * Update architecture specific hints for APIC
> +		 * virtual interrupt delivery.
> +		 */
> +		if (kvm_x86_ops->hwapic_irr_update)
> +			kvm_x86_ops->hwapic_irr_update(vcpu,
> +				kvm_lapic_find_highest_irr(vcpu));
> +	}
> +

This is a hot fast path. You can set KVM_REQ_EVENT from wakeup_handler.

>  	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
>  		kvm_apic_accept_events(vcpu);
>  		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
> @@ -6168,13 +6183,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  			kvm_x86_ops->enable_irq_window(vcpu);
>  
>  		if (kvm_lapic_enabled(vcpu)) {
> -			/*
> -			 * Update architecture specific hints for APIC
> -			 * virtual interrupt delivery.
> -			 */
> -			if (kvm_x86_ops->hwapic_irr_update)
> -				kvm_x86_ops->hwapic_irr_update(vcpu,
> -					kvm_lapic_find_highest_irr(vcpu));
>  			update_cr8_intercept(vcpu);
>  			kvm_lapic_sync_to_vapic(vcpu);
>  		}
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3d7242c..d981d16 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -239,6 +239,9 @@ struct kvm_vcpu {
>  	unsigned long requests;
>  	unsigned long guest_debug;
>  
> +	int wakeup_cpu;
> +	struct list_head blocked_vcpu_list;
> +
>  	struct mutex mutex;
>  	struct kvm_run *run;
>  
> @@ -282,6 +285,7 @@ struct kvm_vcpu {
>  	} spin_loop;
>  #endif
>  	bool preempted;
> +	bool blocked;
>  	struct kvm_vcpu_arch arch;
>  };

Please remove blocked and wakeup_cpu, they should not be necessary.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ba53fd6..6deb994 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -233,6 +233,9 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  
>  	INIT_WORK(&vcpu->wakeup_worker, wakeup_thread);
>  
> +	vcpu->wakeup_cpu = -1;
> +	INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
> +
>  	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>  	if (!page) {
>  		r = -ENOMEM;
> @@ -243,6 +246,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  	kvm_vcpu_set_in_spin_loop(vcpu, false);
>  	kvm_vcpu_set_dy_eligible(vcpu, false);
>  	vcpu->preempted = false;
> +	vcpu->blocked = false;
>  
>  	r = kvm_arch_vcpu_init(vcpu);
>  	if (r < 0)
> @@ -1752,6 +1756,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  	DEFINE_WAIT(wait);
>  
>  	for (;;) {
> +		vcpu->blocked = true;
>  		prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
>  
>  		if (kvm_arch_vcpu_runnable(vcpu)) {
> @@ -1767,6 +1772,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  	}
>  
>  	finish_wait(&vcpu->wq, &wait);
> +	vcpu->blocked = false;
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_block);
>  
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Feb. 26, 2015, 11:41 p.m. UTC | #8
On Thu, Feb 26, 2015 at 08:08:15AM +0000, Wu, Feng wrote:
> 
> 
> > -----Original Message-----
> > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > Sent: Thursday, February 26, 2015 5:50 AM
> > To: Wu, Feng
> > Cc: tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com; x86@kernel.org;
> > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> > is blocked
> > 
> > On Fri, Dec 12, 2014 at 11:14:58PM +0800, Feng Wu wrote:
> > > This patch updates the Posted-Interrupts Descriptor when vCPU
> > > is blocked.
> > >
> > > pre-block:
> > > - Add the vCPU to the blocked per-CPU list
> > > - Clear 'SN'
> > > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> > >
> > > post-block:
> > > - Remove the vCPU from the per-CPU list
> > >
> > > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > > ---
> > 
> > Don't see this is needed, can use the existing POSTED_INTR_VECTOR:
> > 
> > If in guest mode, IPI will be handled in VMX non-root by performed
> > PIR->IRR transfer.
> > 
> > If outside guest mode, POSTED_INTR_VECTOR IPI will be handled by host
> > which can wakeup the guest (in case it is halted).
> 
> Please see the following scenario:
> 
> 1. vCPU0 is running on pCPU0
> 2. vCPU0 is halted and vCPU1 is currently running on pCPU0
> 3. An interrupt occurs for vCPU0, if we still use POSTED_INTR_VECTOR
> for vCPU0, the notification event for vCPU0 (the event will go to pCPU1)
> will be consumed by vCPU1 incorrectly. The worst case is that vCPU0
> will never be woken up again since the wakeup event for it is always
> consumed by other vCPUs incorrectly.
> 
> Thanks,
> Feng

Ouch, yes.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu, Feng March 2, 2015, 1:36 p.m. UTC | #9
> -----Original Message-----
> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Friday, February 27, 2015 7:41 AM
> To: Wu, Feng
> Cc: tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com; x86@kernel.org;
> gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
> 
> On Fri, Dec 12, 2014 at 11:14:58PM +0800, Feng Wu wrote:
> > This patch updates the Posted-Interrupts Descriptor when vCPU
> > is blocked.
> >
> > pre-block:
> > - Add the vCPU to the blocked per-CPU list
> > - Clear 'SN'
> > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> >
> > post-block:
> > - Remove the vCPU from the per-CPU list
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 +
> >  arch/x86/kvm/vmx.c              | 96
> +++++++++++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/x86.c              | 22 +++++++---
> >  include/linux/kvm_host.h        |  4 ++
> >  virt/kvm/kvm_main.c             |  6 +++
> >  5 files changed, 123 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h
> > index 13e3e40..32c110a 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -101,6 +101,8 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t
> base_gfn, int level)
> >
> >  #define ASYNC_PF_PER_VCPU 64
> >
> > +extern void (*wakeup_handler_callback)(void);
> > +
> >  enum kvm_reg {
> >  	VCPU_REGS_RAX = 0,
> >  	VCPU_REGS_RCX = 1,
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index bf2e6cd..a1c83a2 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -832,6 +832,13 @@ static DEFINE_PER_CPU(struct vmcs *,
> current_vmcs);
> >  static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
> >  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
> >
> > +/*
> > + * We maintian a per-CPU linked-list of vCPU, so in wakeup_handler() we
> > + * can find which vCPU should be waken up.
> > + */
> > +static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
> > +static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
> > +
> >  static unsigned long *vmx_io_bitmap_a;
> >  static unsigned long *vmx_io_bitmap_b;
> >  static unsigned long *vmx_msr_bitmap_legacy;
> > @@ -1921,6 +1928,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu,
> int cpu)
> >  		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> >  		struct pi_desc old, new;
> >  		unsigned int dest;
> > +		unsigned long flags;
> >
> >  		memset(&old, 0, sizeof(old));
> >  		memset(&new, 0, sizeof(new));
> > @@ -1942,6 +1950,20 @@ static void vmx_vcpu_load(struct kvm_vcpu
> *vcpu, int cpu)
> >  			new.nv = POSTED_INTR_VECTOR;
> >  		} while (cmpxchg(&pi_desc->control, old.control,
> >  				new.control) != old.control);
> > +
> > +		/*
> > +		 * Delete the vCPU from the related wakeup queue
> > +		 * if we are resuming from blocked state
> > +		 */
> > +		if (vcpu->blocked) {
> > +			vcpu->blocked = false;
> > +			spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> > +				vcpu->wakeup_cpu), flags);
> > +			list_del(&vcpu->blocked_vcpu_list);
> > +			spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
> > +				vcpu->wakeup_cpu), flags);
> > +			vcpu->wakeup_cpu = -1;
> > +		}
> >  	}
> >  }
> >
> > @@ -1950,6 +1972,9 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
> >  	if (irq_remapping_cap(IRQ_POSTING_CAP)) {
> >  		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> >  		struct pi_desc old, new;
> > +		unsigned long flags;
> > +		int cpu;
> > +		struct cpumask cpu_others_mask;
> >
> >  		memset(&old, 0, sizeof(old));
> >  		memset(&new, 0, sizeof(new));
> > @@ -1961,6 +1986,54 @@ static void vmx_vcpu_put(struct kvm_vcpu
> *vcpu)
> >  				pi_set_sn(&new);
> >  			} while (cmpxchg(&pi_desc->control, old.control,
> >  					new.control) != old.control);
> > +		} else if (vcpu->blocked) {
> > +			/*
> > +			 * The vcpu is blocked on the wait queue.
> > +			 * Store the blocked vCPU on the list of the
> > +			 * vcpu->wakeup_cpu, which is the destination
> > +			 * of the wake-up notification event.
> > +			 */
> > +			vcpu->wakeup_cpu = vcpu->cpu;
> > +			spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> > +					  vcpu->wakeup_cpu), flags);
> > +			list_add_tail(&vcpu->blocked_vcpu_list,
> > +				      &per_cpu(blocked_vcpu_on_cpu,
> > +				      vcpu->wakeup_cpu));
> > +			spin_unlock_irqrestore(
> > +					&per_cpu(blocked_vcpu_on_cpu_lock,
> > +					vcpu->wakeup_cpu), flags);
> > +
> > +			do {
> > +				old.control = new.control = pi_desc->control;
> > +
> > +				/*
> > +				 * We should not block the vCPU if
> > +				 * an interrupt is posted for it.
> > +				 */
> > +				if (pi_test_on(pi_desc) == 1) {
> > +					/*
> > +					 * We need schedule the wakeup worker
> > +					 * on a different cpu other than
> > +					 * vcpu->cpu, because in some case,
> > +					 * schedule_work() will call
> > +					 * try_to_wake_up() which needs acquire
> > +					 * the rq lock. This can cause deadlock.
> > +					 */
> > +					cpumask_copy(&cpu_others_mask,
> > +						     cpu_online_mask);
> > +					cpu_clear(vcpu->cpu, cpu_others_mask);
> > +					cpu = any_online_cpu(cpu_others_mask);
> > +
> > +					schedule_work_on(cpu,
> > +							 &vcpu->wakeup_worker);
> > +				}
> > +
> > +				pi_clear_sn(&new);
> > +
> > +				/* set 'NV' to 'wakeup vector' */
> > +				new.nv = POSTED_INTR_WAKEUP_VECTOR;
> > +			} while (cmpxchg(&pi_desc->control, old.control,
> > +				new.control) != old.control);
> >  		}
> 
> This can be done exclusively on HLT emulation, correct? (that is, on
> entry to HLT and exit from HLT).

Do you mean the following?
In kvm_emulate_halt(), we do:
1. Add vCPU in the blocking list
2. Clear 'SN'
3. set 'NV' to POSTED_INTR_WAKEUP_VECTOR

In __vcpu_run(), after kvm_vcpu_block(), we remove the vCPU from the
Bloc king list.

> 
> If the vcpu is scheduled out for any other reason (transition to
> userspace or transition to other thread), it will eventually resume
> execution. And in that case, continuation of execution does not depend
> on the event (VT-d interrupt) notification.

Yes, I think this is true for my current implementation, right?

> 
> There is a race window with the code above, I believe.

I did careful code review back and forth for the above code, It will
be highly appreciated if you can point out the race window!

> 
> >  	}
> >
> > @@ -2842,6 +2915,8 @@ static int hardware_enable(void)
> >  		return -EBUSY;
> >
> >  	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> > +	INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
> > +	spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> >
> >  	/*
> >  	 * Now we can enable the vmclear operation in kdump
> > @@ -9315,6 +9390,25 @@ static struct kvm_x86_ops vmx_x86_ops = {
> >  	.pi_set_sn = vmx_pi_set_sn,
> >  };
> >
> > +/*
> > + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
> > + */
> > +void wakeup_handler(void)
> > +{
> > +	struct kvm_vcpu *vcpu;
> > +	int cpu = smp_processor_id();
> > +
> > +	spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > +	list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
> > +			blocked_vcpu_list) {
> > +		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > +
> > +		if (pi_test_on(pi_desc) == 1)
> > +			kvm_vcpu_kick(vcpu);
> > +	}
> > +	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > +}
> 
> Looping through all blocked vcpus does not scale:
> Can you allocate more vectors and then multiplex those
> vectors amongst the HLT'ed vcpus?

I am a little confused about this, can you elaborate it a bit more?
Thanks a lot!

> 
> It seems there is a bunch free:
> 
> commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4
> Author: Alex Shi <alex.shi@intel.com>
> Date:   Thu Jun 28 09:02:23 2012 +0800
> 
>     x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR
> 
> Can you add only vcpus which have posted IRTEs that point to this pCPU
> to the HLT'ed vcpu lists? (so for example, vcpus without assigned
> devices are not part of the list).

Is it easy to find whether a vCPU (or the associated domain) has assigned devices?
If so, we can only add those vCPUs with assigned devices.

> 
> > +
> >  static int __init vmx_init(void)
> >  {
> >  	int r, i, msr;
> > @@ -9429,6 +9523,8 @@ static int __init vmx_init(void)
> >
> >  	update_ple_window_actual_max();
> >
> > +	wakeup_handler_callback = wakeup_handler;
> > +
> >  	return 0;
> >
> >  out7:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 0033df3..1551a46 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -6152,6 +6152,21 @@ static int vcpu_enter_guest(struct kvm_vcpu
> *vcpu)
> >  			kvm_vcpu_reload_apic_access_page(vcpu);
> >  	}
> >
> > +	/*
> > +	 * Since posted-interrupts can be set by VT-d HW now, in this
> > +	 * case, KVM_REQ_EVENT is not set. We move the following
> > +	 * operations out of the if statement.
> > +	 */
> > +	if (kvm_lapic_enabled(vcpu)) {
> > +		/*
> > +		 * Update architecture specific hints for APIC
> > +		 * virtual interrupt delivery.
> > +		 */
> > +		if (kvm_x86_ops->hwapic_irr_update)
> > +			kvm_x86_ops->hwapic_irr_update(vcpu,
> > +				kvm_lapic_find_highest_irr(vcpu));
> > +	}
> > +
> 
> This is a hot fast path. You can set KVM_REQ_EVENT from wakeup_handler.

I am afraid Setting KVM_REQ_EVENT from wakeup_handler doesn't help much,
if vCPU is running in ROOT mode, and VT-d hardware issues an notification event,
POSTED_INTR_VECTOR interrupt handler will be called.

Again, POSTED_INTR_VECTOR interrupt handler may be called very frequently,
it is a little hard to get vCPU related information in it, even if we get, it is not
accurate and may harm the performance.(need search)

> 
> >  	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> >  		kvm_apic_accept_events(vcpu);
> >  		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
> > @@ -6168,13 +6183,6 @@ static int vcpu_enter_guest(struct kvm_vcpu
> *vcpu)
> >  			kvm_x86_ops->enable_irq_window(vcpu);
> >
> >  		if (kvm_lapic_enabled(vcpu)) {
> > -			/*
> > -			 * Update architecture specific hints for APIC
> > -			 * virtual interrupt delivery.
> > -			 */
> > -			if (kvm_x86_ops->hwapic_irr_update)
> > -				kvm_x86_ops->hwapic_irr_update(vcpu,
> > -					kvm_lapic_find_highest_irr(vcpu));
> >  			update_cr8_intercept(vcpu);
> >  			kvm_lapic_sync_to_vapic(vcpu);
> >  		}
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3d7242c..d981d16 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -239,6 +239,9 @@ struct kvm_vcpu {
> >  	unsigned long requests;
> >  	unsigned long guest_debug;
> >
> > +	int wakeup_cpu;
> > +	struct list_head blocked_vcpu_list;
> > +
> >  	struct mutex mutex;
> >  	struct kvm_run *run;
> >
> > @@ -282,6 +285,7 @@ struct kvm_vcpu {
> >  	} spin_loop;
> >  #endif
> >  	bool preempted;
> > +	bool blocked;
> >  	struct kvm_vcpu_arch arch;
> >  };
> 
> Please remove blocked and wakeup_cpu, they should not be necessary.

Why do you think wakeup_cpu is not needed, when vCPU is blocked, 
wakeup_cpu saves the cpu which the vCPU is blocked on, after vCPU
is woken up, it can run on a different cpu, so we need wakeup_cpu to
find the right list to wake up the vCPU.

Thanks,
Feng

> 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ba53fd6..6deb994 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -233,6 +233,9 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm
> *kvm, unsigned id)
> >
> >  	INIT_WORK(&vcpu->wakeup_worker, wakeup_thread);
> >
> > +	vcpu->wakeup_cpu = -1;
> > +	INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
> > +
> >  	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >  	if (!page) {
> >  		r = -ENOMEM;
> > @@ -243,6 +246,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm
> *kvm, unsigned id)
> >  	kvm_vcpu_set_in_spin_loop(vcpu, false);
> >  	kvm_vcpu_set_dy_eligible(vcpu, false);
> >  	vcpu->preempted = false;
> > +	vcpu->blocked = false;
> >
> >  	r = kvm_arch_vcpu_init(vcpu);
> >  	if (r < 0)
> > @@ -1752,6 +1756,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> >  	DEFINE_WAIT(wait);
> >
> >  	for (;;) {
> > +		vcpu->blocked = true;
> >  		prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
> >
> >  		if (kvm_arch_vcpu_runnable(vcpu)) {
> > @@ -1767,6 +1772,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> >  	}
> >
> >  	finish_wait(&vcpu->wq, &wait);
> > +	vcpu->blocked = false;
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_vcpu_block);
> >
> > --
> > 1.9.1
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti March 4, 2015, 12:06 p.m. UTC | #10
On Mon, Mar 02, 2015 at 01:36:51PM +0000, Wu, Feng wrote:
> 
> 
> > -----Original Message-----
> > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > Sent: Friday, February 27, 2015 7:41 AM
> > To: Wu, Feng
> > Cc: tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com; x86@kernel.org;
> > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> > is blocked
> > 
> > On Fri, Dec 12, 2014 at 11:14:58PM +0800, Feng Wu wrote:
> > > This patch updates the Posted-Interrupts Descriptor when vCPU
> > > is blocked.
> > >
> > > pre-block:
> > > - Add the vCPU to the blocked per-CPU list
> > > - Clear 'SN'
> > > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> > >
> > > post-block:
> > > - Remove the vCPU from the per-CPU list
> > >
> > > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > > ---
> > >  arch/x86/include/asm/kvm_host.h |  2 +
> > >  arch/x86/kvm/vmx.c              | 96
> > +++++++++++++++++++++++++++++++++++++++++
> > >  arch/x86/kvm/x86.c              | 22 +++++++---
> > >  include/linux/kvm_host.h        |  4 ++
> > >  virt/kvm/kvm_main.c             |  6 +++
> > >  5 files changed, 123 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h
> > b/arch/x86/include/asm/kvm_host.h
> > > index 13e3e40..32c110a 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -101,6 +101,8 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t
> > base_gfn, int level)
> > >
> > >  #define ASYNC_PF_PER_VCPU 64
> > >
> > > +extern void (*wakeup_handler_callback)(void);
> > > +
> > >  enum kvm_reg {
> > >  	VCPU_REGS_RAX = 0,
> > >  	VCPU_REGS_RCX = 1,
> > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > > index bf2e6cd..a1c83a2 100644
> > > --- a/arch/x86/kvm/vmx.c
> > > +++ b/arch/x86/kvm/vmx.c
> > > @@ -832,6 +832,13 @@ static DEFINE_PER_CPU(struct vmcs *,
> > current_vmcs);
> > >  static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
> > >  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
> > >
> > > +/*
> > > + * We maintian a per-CPU linked-list of vCPU, so in wakeup_handler() we
> > > + * can find which vCPU should be waken up.
> > > + */
> > > +static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
> > > +static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
> > > +
> > >  static unsigned long *vmx_io_bitmap_a;
> > >  static unsigned long *vmx_io_bitmap_b;
> > >  static unsigned long *vmx_msr_bitmap_legacy;
> > > @@ -1921,6 +1928,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu,
> > int cpu)
> > >  		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > >  		struct pi_desc old, new;
> > >  		unsigned int dest;
> > > +		unsigned long flags;
> > >
> > >  		memset(&old, 0, sizeof(old));
> > >  		memset(&new, 0, sizeof(new));
> > > @@ -1942,6 +1950,20 @@ static void vmx_vcpu_load(struct kvm_vcpu
> > *vcpu, int cpu)
> > >  			new.nv = POSTED_INTR_VECTOR;
> > >  		} while (cmpxchg(&pi_desc->control, old.control,
> > >  				new.control) != old.control);
> > > +
> > > +		/*
> > > +		 * Delete the vCPU from the related wakeup queue
> > > +		 * if we are resuming from blocked state
> > > +		 */
> > > +		if (vcpu->blocked) {
> > > +			vcpu->blocked = false;
> > > +			spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> > > +				vcpu->wakeup_cpu), flags);
> > > +			list_del(&vcpu->blocked_vcpu_list);
> > > +			spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
> > > +				vcpu->wakeup_cpu), flags);
> > > +			vcpu->wakeup_cpu = -1;
> > > +		}
> > >  	}
> > >  }
> > >
> > > @@ -1950,6 +1972,9 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
> > >  	if (irq_remapping_cap(IRQ_POSTING_CAP)) {
> > >  		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > >  		struct pi_desc old, new;
> > > +		unsigned long flags;
> > > +		int cpu;
> > > +		struct cpumask cpu_others_mask;
> > >
> > >  		memset(&old, 0, sizeof(old));
> > >  		memset(&new, 0, sizeof(new));
> > > @@ -1961,6 +1986,54 @@ static void vmx_vcpu_put(struct kvm_vcpu
> > *vcpu)
> > >  				pi_set_sn(&new);
> > >  			} while (cmpxchg(&pi_desc->control, old.control,
> > >  					new.control) != old.control);
> > > +		} else if (vcpu->blocked) {
> > > +			/*
> > > +			 * The vcpu is blocked on the wait queue.
> > > +			 * Store the blocked vCPU on the list of the
> > > +			 * vcpu->wakeup_cpu, which is the destination
> > > +			 * of the wake-up notification event.
> > > +			 */
> > > +			vcpu->wakeup_cpu = vcpu->cpu;
> > > +			spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> > > +					  vcpu->wakeup_cpu), flags);
> > > +			list_add_tail(&vcpu->blocked_vcpu_list,
> > > +				      &per_cpu(blocked_vcpu_on_cpu,
> > > +				      vcpu->wakeup_cpu));
> > > +			spin_unlock_irqrestore(
> > > +					&per_cpu(blocked_vcpu_on_cpu_lock,
> > > +					vcpu->wakeup_cpu), flags);
> > > +
> > > +			do {
> > > +				old.control = new.control = pi_desc->control;
> > > +
> > > +				/*
> > > +				 * We should not block the vCPU if
> > > +				 * an interrupt is posted for it.
> > > +				 */
> > > +				if (pi_test_on(pi_desc) == 1) {
> > > +					/*
> > > +					 * We need schedule the wakeup worker
> > > +					 * on a different cpu other than
> > > +					 * vcpu->cpu, because in some case,
> > > +					 * schedule_work() will call
> > > +					 * try_to_wake_up() which needs acquire
> > > +					 * the rq lock. This can cause deadlock.
> > > +					 */
> > > +					cpumask_copy(&cpu_others_mask,
> > > +						     cpu_online_mask);
> > > +					cpu_clear(vcpu->cpu, cpu_others_mask);
> > > +					cpu = any_online_cpu(cpu_others_mask);
> > > +
> > > +					schedule_work_on(cpu,
> > > +							 &vcpu->wakeup_worker);
> > > +				}
> > > +
> > > +				pi_clear_sn(&new);
> > > +
> > > +				/* set 'NV' to 'wakeup vector' */
> > > +				new.nv = POSTED_INTR_WAKEUP_VECTOR;
> > > +			} while (cmpxchg(&pi_desc->control, old.control,
> > > +				new.control) != old.control);
> > >  		}
> > 
> > This can be done exclusively on HLT emulation, correct? (that is, on
> > entry to HLT and exit from HLT).
> 
> Do you mean the following?
> In kvm_emulate_halt(), we do:
> 1. Add vCPU in the blocking list
> 2. Clear 'SN'
> 3. set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> 
> In __vcpu_run(), after kvm_vcpu_block(), we remove the vCPU from the
> Bloc king list.

Yes (please check its OK to do this...).

> > If the vcpu is scheduled out for any other reason (transition to
> > userspace or transition to other thread), it will eventually resume
> > execution. And in that case, continuation of execution does not depend
> > on the event (VT-d interrupt) notification.
> 
> Yes, I think this is true for my current implementation, right?
> 
> > 
> > There is a race window with the code above, I believe.
> 
> I did careful code review back and forth for the above code, It will
> be highly appreciated if you can point out the race window!

So the remapping HW sees either POSTED_INTR_VECTOR or 
POSTED_INTR_WAKEUP_VECTOR.

You should:

1. Set POSTED_INTR_WAKEUP_VECTOR.
2. Check for PIR / ON bit, which might have been set by
POSTED_INTR_VECTOR notification.
3. emulate HLT.

> > >  	} 
> > >
> > > @@ -2842,6 +2915,8 @@ static int hardware_enable(void)
> > >  		return -EBUSY;
> > >
> > >  	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> > > +	INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
> > > +	spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > >
> > >  	/*
> > >  	 * Now we can enable the vmclear operation in kdump
> > > @@ -9315,6 +9390,25 @@ static struct kvm_x86_ops vmx_x86_ops = {
> > >  	.pi_set_sn = vmx_pi_set_sn,
> > >  };
> > >
> > > +/*
> > > + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
> > > + */
> > > +void wakeup_handler(void)
> > > +{
> > > +	struct kvm_vcpu *vcpu;
> > > +	int cpu = smp_processor_id();
> > > +
> > > +	spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > +	list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
> > > +			blocked_vcpu_list) {
> > > +		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > > +
> > > +		if (pi_test_on(pi_desc) == 1)
> > > +			kvm_vcpu_kick(vcpu);
> > > +	}
> > > +	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > +}
> > 
> > Looping through all blocked vcpus does not scale:
> > Can you allocate more vectors and then multiplex those
> > vectors amongst the HLT'ed vcpus?
> 
> I am a little confused about this, can you elaborate it a bit more?
> Thanks a lot!

Picture the following overcommitment scenario:

* High ratio of vCPUs/pCPUs, in the ratio 128/1 (this is exaggerated
to demonstrate the issue).
* Every VT-d interrupt is going to scan 128 entries in the list.

Moreover, the test:

		if (pi_test_on(pi_desc) == 1)
			kvm_vcpu_kick(vcpu);

Can trigger for vCPUs which have not been waken up due 
to VT-d interrupts, but for other interrupts.

You can allocate, say 16 vectors on the pCPU for VT-d interrupts:

POSTED_INTERRUPT_WAKEUP_VECTOR_1, POSTED_INTERRUPT_WAKEUP_VECTOR_2,
...

> > It seems there is a bunch free:
> > 
> > commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4
> > Author: Alex Shi <alex.shi@intel.com>
> > Date:   Thu Jun 28 09:02:23 2012 +0800
> > 
> >     x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR
> > 
> > Can you add only vcpus which have posted IRTEs that point to this pCPU
> > to the HLT'ed vcpu lists? (so for example, vcpus without assigned
> > devices are not part of the list).
> 
> Is it easy to find whether a vCPU (or the associated domain) has assigned devices?
> If so, we can only add those vCPUs with assigned devices.

When configuring IRTE, at kvm_arch_vfio_update_pi_irte?

> > > +
> > >  static int __init vmx_init(void)
> > >  {
> > >  	int r, i, msr;
> > > @@ -9429,6 +9523,8 @@ static int __init vmx_init(void)
> > >
> > >  	update_ple_window_actual_max();
> > >
> > > +	wakeup_handler_callback = wakeup_handler;
> > > +
> > >  	return 0;
> > >
> > >  out7:
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 0033df3..1551a46 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -6152,6 +6152,21 @@ static int vcpu_enter_guest(struct kvm_vcpu
> > *vcpu)
> > >  			kvm_vcpu_reload_apic_access_page(vcpu);
> > >  	}
> > >
> > > +	/*
> > > +	 * Since posted-interrupts can be set by VT-d HW now, in this
> > > +	 * case, KVM_REQ_EVENT is not set. We move the following
> > > +	 * operations out of the if statement.
> > > +	 */
> > > +	if (kvm_lapic_enabled(vcpu)) {
> > > +		/*
> > > +		 * Update architecture specific hints for APIC
> > > +		 * virtual interrupt delivery.
> > > +		 */
> > > +		if (kvm_x86_ops->hwapic_irr_update)
> > > +			kvm_x86_ops->hwapic_irr_update(vcpu,
> > > +				kvm_lapic_find_highest_irr(vcpu));
> > > +	}
> > > +
> > 
> > This is a hot fast path. You can set KVM_REQ_EVENT from wakeup_handler.
> 
> I am afraid Setting KVM_REQ_EVENT from wakeup_handler doesn't help much,
> if vCPU is running in ROOT mode, and VT-d hardware issues an notification event,
> POSTED_INTR_VECTOR interrupt handler will be called.

If vCPU is in root mode, remapping HW will find IRTE configured with
vector == POSTED_INTR_WAKEUP_VECTOR, use that vector, which will
VM-exit, and execute the interrupt handler wakeup_handler. Right?

The point of this comment is that you can keep the 

"if (kvm_x86_ops->hwapic_irr_update)
	kvm_x86_ops->hwapic_irr_update(vcpu,
			kvm_lapic_find_highest_irr(vcpu));
"

Code inside KVM_REQ_EVENT handling section of vcpu_run, as long as
wakeup_handler sets KVM_REQ_EVENT.

> Again, POSTED_INTR_VECTOR interrupt handler may be called very frequently,
> it is a little hard to get vCPU related information in it, even if we get, it is not
> accurate and may harm the performance.(need search)
> 
> > 
> > >  	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> > >  		kvm_apic_accept_events(vcpu);
> > >  		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
> > > @@ -6168,13 +6183,6 @@ static int vcpu_enter_guest(struct kvm_vcpu
> > *vcpu)
> > >  			kvm_x86_ops->enable_irq_window(vcpu);
> > >
> > >  		if (kvm_lapic_enabled(vcpu)) {
> > > -			/*
> > > -			 * Update architecture specific hints for APIC
> > > -			 * virtual interrupt delivery.
> > > -			 */
> > > -			if (kvm_x86_ops->hwapic_irr_update)
> > > -				kvm_x86_ops->hwapic_irr_update(vcpu,
> > > -					kvm_lapic_find_highest_irr(vcpu));
> > >  			update_cr8_intercept(vcpu);
> > >  			kvm_lapic_sync_to_vapic(vcpu);
> > >  		}
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 3d7242c..d981d16 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -239,6 +239,9 @@ struct kvm_vcpu {
> > >  	unsigned long requests;
> > >  	unsigned long guest_debug;
> > >
> > > +	int wakeup_cpu;
> > > +	struct list_head blocked_vcpu_list;
> > > +
> > >  	struct mutex mutex;
> > >  	struct kvm_run *run;
> > >
> > > @@ -282,6 +285,7 @@ struct kvm_vcpu {
> > >  	} spin_loop;
> > >  #endif
> > >  	bool preempted;
> > > +	bool blocked;
> > >  	struct kvm_vcpu_arch arch;
> > >  };
> > 
> > Please remove blocked and wakeup_cpu, they should not be necessary.
> 
> Why do you think wakeup_cpu is not needed, when vCPU is blocked, 
> wakeup_cpu saves the cpu which the vCPU is blocked on, after vCPU
> is woken up, it can run on a different cpu, so we need wakeup_cpu to
> find the right list to wake up the vCPU.

If the vCPU was moved it should have updated IRTE destination field
to the pCPU which it has moved to?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu, Feng March 6, 2015, 6:51 a.m. UTC | #11
> -----Original Message-----
> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Wednesday, March 04, 2015 8:06 PM
> To: Wu, Feng
> Cc: tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com; x86@kernel.org;
> gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
> 
> On Mon, Mar 02, 2015 at 01:36:51PM +0000, Wu, Feng wrote:
> >
> >
> > > -----Original Message-----
> > > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > > Sent: Friday, February 27, 2015 7:41 AM
> > > To: Wu, Feng
> > > Cc: tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com;
> x86@kernel.org;
> > > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > > joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> > > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
> vCPU
> > > is blocked
> > >
> > > On Fri, Dec 12, 2014 at 11:14:58PM +0800, Feng Wu wrote:
> > > > This patch updates the Posted-Interrupts Descriptor when vCPU
> > > > is blocked.
> > > >
> > > > pre-block:
> > > > - Add the vCPU to the blocked per-CPU list
> > > > - Clear 'SN'
> > > > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> > > >
> > > > post-block:
> > > > - Remove the vCPU from the per-CPU list
> > > >
> > > > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h |  2 +
> > > >  arch/x86/kvm/vmx.c              | 96
> > > +++++++++++++++++++++++++++++++++++++++++
> > > >  arch/x86/kvm/x86.c              | 22 +++++++---
> > > >  include/linux/kvm_host.h        |  4 ++
> > > >  virt/kvm/kvm_main.c             |  6 +++
> > > >  5 files changed, 123 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h
> > > b/arch/x86/include/asm/kvm_host.h
> > > > index 13e3e40..32c110a 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -101,6 +101,8 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t
> > > base_gfn, int level)
> > > >
> > > >  #define ASYNC_PF_PER_VCPU 64
> > > >
> > > > +extern void (*wakeup_handler_callback)(void);
> > > > +
> > > >  enum kvm_reg {
> > > >  	VCPU_REGS_RAX = 0,
> > > >  	VCPU_REGS_RCX = 1,
> > > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > > > index bf2e6cd..a1c83a2 100644
> > > > --- a/arch/x86/kvm/vmx.c
> > > > +++ b/arch/x86/kvm/vmx.c
> > > > @@ -832,6 +832,13 @@ static DEFINE_PER_CPU(struct vmcs *,
> > > current_vmcs);
> > > >  static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
> > > >  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
> > > >
> > > > +/*
> > > > + * We maintian a per-CPU linked-list of vCPU, so in wakeup_handler() we
> > > > + * can find which vCPU should be waken up.
> > > > + */
> > > > +static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
> > > > +static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
> > > > +
> > > >  static unsigned long *vmx_io_bitmap_a;
> > > >  static unsigned long *vmx_io_bitmap_b;
> > > >  static unsigned long *vmx_msr_bitmap_legacy;
> > > > @@ -1921,6 +1928,7 @@ static void vmx_vcpu_load(struct kvm_vcpu
> *vcpu,
> > > int cpu)
> > > >  		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > > >  		struct pi_desc old, new;
> > > >  		unsigned int dest;
> > > > +		unsigned long flags;
> > > >
> > > >  		memset(&old, 0, sizeof(old));
> > > >  		memset(&new, 0, sizeof(new));
> > > > @@ -1942,6 +1950,20 @@ static void vmx_vcpu_load(struct kvm_vcpu
> > > *vcpu, int cpu)
> > > >  			new.nv = POSTED_INTR_VECTOR;
> > > >  		} while (cmpxchg(&pi_desc->control, old.control,
> > > >  				new.control) != old.control);
> > > > +
> > > > +		/*
> > > > +		 * Delete the vCPU from the related wakeup queue
> > > > +		 * if we are resuming from blocked state
> > > > +		 */
> > > > +		if (vcpu->blocked) {
> > > > +			vcpu->blocked = false;
> > > > +			spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> > > > +				vcpu->wakeup_cpu), flags);
> > > > +			list_del(&vcpu->blocked_vcpu_list);
> > > > +
> 	spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
> > > > +				vcpu->wakeup_cpu), flags);
> > > > +			vcpu->wakeup_cpu = -1;
> > > > +		}
> > > >  	}
> > > >  }
> > > >
> > > > @@ -1950,6 +1972,9 @@ static void vmx_vcpu_put(struct kvm_vcpu
> *vcpu)
> > > >  	if (irq_remapping_cap(IRQ_POSTING_CAP)) {
> > > >  		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > > >  		struct pi_desc old, new;
> > > > +		unsigned long flags;
> > > > +		int cpu;
> > > > +		struct cpumask cpu_others_mask;
> > > >
> > > >  		memset(&old, 0, sizeof(old));
> > > >  		memset(&new, 0, sizeof(new));
> > > > @@ -1961,6 +1986,54 @@ static void vmx_vcpu_put(struct kvm_vcpu
> > > *vcpu)
> > > >  				pi_set_sn(&new);
> > > >  			} while (cmpxchg(&pi_desc->control, old.control,
> > > >  					new.control) != old.control);
> > > > +		} else if (vcpu->blocked) {
> > > > +			/*
> > > > +			 * The vcpu is blocked on the wait queue.
> > > > +			 * Store the blocked vCPU on the list of the
> > > > +			 * vcpu->wakeup_cpu, which is the destination
> > > > +			 * of the wake-up notification event.
> > > > +			 */
> > > > +			vcpu->wakeup_cpu = vcpu->cpu;
> > > > +			spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> > > > +					  vcpu->wakeup_cpu), flags);
> > > > +			list_add_tail(&vcpu->blocked_vcpu_list,
> > > > +				      &per_cpu(blocked_vcpu_on_cpu,
> > > > +				      vcpu->wakeup_cpu));
> > > > +			spin_unlock_irqrestore(
> > > > +					&per_cpu(blocked_vcpu_on_cpu_lock,
> > > > +					vcpu->wakeup_cpu), flags);
> > > > +
> > > > +			do {
> > > > +				old.control = new.control = pi_desc->control;
> > > > +
> > > > +				/*
> > > > +				 * We should not block the vCPU if
> > > > +				 * an interrupt is posted for it.
> > > > +				 */
> > > > +				if (pi_test_on(pi_desc) == 1) {
> > > > +					/*
> > > > +					 * We need schedule the wakeup worker
> > > > +					 * on a different cpu other than
> > > > +					 * vcpu->cpu, because in some case,
> > > > +					 * schedule_work() will call
> > > > +					 * try_to_wake_up() which needs acquire
> > > > +					 * the rq lock. This can cause deadlock.
> > > > +					 */
> > > > +					cpumask_copy(&cpu_others_mask,
> > > > +						     cpu_online_mask);
> > > > +					cpu_clear(vcpu->cpu, cpu_others_mask);
> > > > +					cpu = any_online_cpu(cpu_others_mask);
> > > > +
> > > > +					schedule_work_on(cpu,
> > > > +							 &vcpu->wakeup_worker);
> > > > +				}
> > > > +
> > > > +				pi_clear_sn(&new);
> > > > +
> > > > +				/* set 'NV' to 'wakeup vector' */
> > > > +				new.nv = POSTED_INTR_WAKEUP_VECTOR;
> > > > +			} while (cmpxchg(&pi_desc->control, old.control,
> > > > +				new.control) != old.control);
> > > >  		}
> > >
> > > This can be done exclusively on HLT emulation, correct? (that is, on
> > > entry to HLT and exit from HLT).
> >
> > Do you mean the following?
> > In kvm_emulate_halt(), we do:
> > 1. Add vCPU in the blocking list
> > 2. Clear 'SN'
> > 3. set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> >
> > In __vcpu_run(), after kvm_vcpu_block(), we remove the vCPU from the
> > Bloc king list.
> 
> Yes (please check its OK to do this...).

I think about this for some time, and I feel this may be another solution
to implement it. Do you mind sharing your ideas about why do you think
this alternative is better than the current one? Thanks a lot!

> 
> > > If the vcpu is scheduled out for any other reason (transition to
> > > userspace or transition to other thread), it will eventually resume
> > > execution. And in that case, continuation of execution does not depend
> > > on the event (VT-d interrupt) notification.
> >
> > Yes, I think this is true for my current implementation, right?
> >
> > >
> > > There is a race window with the code above, I believe.
> >
> > I did careful code review back and forth for the above code, It will
> > be highly appreciated if you can point out the race window!
> 
> So the remapping HW sees either POSTED_INTR_VECTOR or
> POSTED_INTR_WAKEUP_VECTOR.
> 
> You should:
> 
> 1. Set POSTED_INTR_WAKEUP_VECTOR.
> 2. Check for PIR / ON bit, which might have been set by
> POSTED_INTR_VECTOR notification.
> 3. emulate HLT.

My original idea for pre-block operation is:
1. Add vCPU to the per-cpu blocking list. Here is the code for this in my patch:
+                       vcpu->wakeup_cpu = vcpu->cpu;
+                       spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
+                                         vcpu->wakeup_cpu), flags);
+                       list_add_tail(&vcpu->blocked_vcpu_list,
+                                     &per_cpu(blocked_vcpu_on_cpu,
+                                     vcpu->wakeup_cpu));
+                       spin_unlock_irqrestore(
+                                       &per_cpu(blocked_vcpu_on_cpu_lock,
+                                       vcpu->wakeup_cpu), flags);
2. Update Posted-interrupt descriptor, here is the code in my patch:
+                       do {
+                               old.control = new.control = pi_desc->control;
+
+                               /*
+                                * We should not block the vCPU if
+                                * an interrupt is posted for it.
+                                */
+                               if (pi_test_on(pi_desc) == 1) {
+                                       /*
+                                        * We need schedule the wakeup worker
+                                        * on a different cpu other than
+                                        * vcpu->cpu, because in some case,
+                                        * schedule_work() will call
+                                        * try_to_wake_up() which needs acquire
+                                        * the rq lock. This can cause deadlock.
+                                        */
+                                       cpumask_copy(&cpu_others_mask,
+                                                    cpu_online_mask);
+                                       cpu_clear(vcpu->cpu, cpu_others_mask);
+                                       cpu = any_online_cpu(cpu_others_mask);
+
+                                       schedule_work_on(cpu,
+                                                        &vcpu->wakeup_worker);
+                               }
+
+                               WARN((pi_desc->sn == 1),
+                                    "Warning: SN field of posted-interrupts "
+                                    "is set before blocking\n");
+
+                               /* set 'NV' to 'wakeup vector' */
+                               new.nv = POSTED_INTR_WAKEUP_VECTOR;
+                       } while (cmpxchg(&pi_desc->control, old.control,
+                               new.control) != old.control);

If PIR/ON bit is set by POSTED_INTR_VECTOR notification during the above operation, we will stop
blocking the vCPU like about. But seems I missed something in the above code which should be in
my mind from the beginning, I should add a 'break' in the end the above ' if (pi_test_on(pi_desc) == 1) {}',
so in this case, the 'NV' filed remains unchanged.

> 
> > > >  	}
> > > >
> > > > @@ -2842,6 +2915,8 @@ static int hardware_enable(void)
> > > >  		return -EBUSY;
> > > >
> > > >  	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> > > > +	INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
> > > > +	spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > >
> > > >  	/*
> > > >  	 * Now we can enable the vmclear operation in kdump
> > > > @@ -9315,6 +9390,25 @@ static struct kvm_x86_ops vmx_x86_ops = {
> > > >  	.pi_set_sn = vmx_pi_set_sn,
> > > >  };
> > > >
> > > > +/*
> > > > + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
> > > > + */
> > > > +void wakeup_handler(void)
> > > > +{
> > > > +	struct kvm_vcpu *vcpu;
> > > > +	int cpu = smp_processor_id();
> > > > +
> > > > +	spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > > +	list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
> > > > +			blocked_vcpu_list) {
> > > > +		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > > > +
> > > > +		if (pi_test_on(pi_desc) == 1)
> > > > +			kvm_vcpu_kick(vcpu);
> > > > +	}
> > > > +	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > > +}
> > >
> > > Looping through all blocked vcpus does not scale:
> > > Can you allocate more vectors and then multiplex those
> > > vectors amongst the HLT'ed vcpus?
> >
> > I am a little confused about this, can you elaborate it a bit more?
> > Thanks a lot!
> 
> Picture the following overcommitment scenario:
> 
> * High ratio of vCPUs/pCPUs, in the ratio 128/1 (this is exaggerated
> to demonstrate the issue).
> * Every VT-d interrupt is going to scan 128 entries in the list.
> 
> Moreover, the test:
> 
> 		if (pi_test_on(pi_desc) == 1)
> 			kvm_vcpu_kick(vcpu);
> 
> Can trigger for vCPUs which have not been waken up due
> to VT-d interrupts, but for other interrupts.
> 
> You can allocate, say 16 vectors on the pCPU for VT-d interrupts:
> 
> POSTED_INTERRUPT_WAKEUP_VECTOR_1,
> POSTED_INTERRUPT_WAKEUP_VECTOR_2,
> ...
> 

Global vector is a limited resources in the system, and this involves
common x86 interrupt code changes. I am not sure we can allocate
so many dedicated global vector for KVM usage.

> > > It seems there is a bunch free:
> > >
> > > commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4
> > > Author: Alex Shi <alex.shi@intel.com>
> > > Date:   Thu Jun 28 09:02:23 2012 +0800
> > >
> > >     x86/tlb: replace INVALIDATE_TLB_VECTOR by
> CALL_FUNCTION_VECTOR
> > >
> > > Can you add only vcpus which have posted IRTEs that point to this pCPU
> > > to the HLT'ed vcpu lists? (so for example, vcpus without assigned
> > > devices are not part of the list).
> >
> > Is it easy to find whether a vCPU (or the associated domain) has assigned
> devices?
> > If so, we can only add those vCPUs with assigned devices.
> 
> When configuring IRTE, at kvm_arch_vfio_update_pi_irte?

Yes.

> 
> > > > +
> > > >  static int __init vmx_init(void)
> > > >  {
> > > >  	int r, i, msr;
> > > > @@ -9429,6 +9523,8 @@ static int __init vmx_init(void)
> > > >
> > > >  	update_ple_window_actual_max();
> > > >
> > > > +	wakeup_handler_callback = wakeup_handler;
> > > > +
> > > >  	return 0;
> > > >
> > > >  out7:
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 0033df3..1551a46 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -6152,6 +6152,21 @@ static int vcpu_enter_guest(struct kvm_vcpu
> > > *vcpu)
> > > >  			kvm_vcpu_reload_apic_access_page(vcpu);
> > > >  	}
> > > >
> > > > +	/*
> > > > +	 * Since posted-interrupts can be set by VT-d HW now, in this
> > > > +	 * case, KVM_REQ_EVENT is not set. We move the following
> > > > +	 * operations out of the if statement.
> > > > +	 */
> > > > +	if (kvm_lapic_enabled(vcpu)) {
> > > > +		/*
> > > > +		 * Update architecture specific hints for APIC
> > > > +		 * virtual interrupt delivery.
> > > > +		 */
> > > > +		if (kvm_x86_ops->hwapic_irr_update)
> > > > +			kvm_x86_ops->hwapic_irr_update(vcpu,
> > > > +				kvm_lapic_find_highest_irr(vcpu));
> > > > +	}
> > > > +
> > >
> > > This is a hot fast path. You can set KVM_REQ_EVENT from wakeup_handler.
> >
> > I am afraid Setting KVM_REQ_EVENT from wakeup_handler doesn't help
> much,
> > if vCPU is running in ROOT mode, and VT-d hardware issues an notification
> event,
> > POSTED_INTR_VECTOR interrupt handler will be called.
> 
> If vCPU is in root mode, remapping HW will find IRTE configured with
> vector == POSTED_INTR_WAKEUP_VECTOR, use that vector, which will
> VM-exit, and execute the interrupt handler wakeup_handler. Right?

There are two cases:
Case 1: vCPU is blocked, so it is in root mode, this is what you described above.
Case 2, vCPU is running in root mode, such as, handling vm-exits, in this case,
the notification vector is 'POSTED_INTR_VECTOR', and if external interrupts
from assigned devices happen, the handled of 'POSTED_INTR_VECTOR' will
be called ( it is 'smp_kvm_posted_intr_ipi' in fact), this routine doesn't need
do real things, since the pending interrupts in PIR will be synced to vIRR before
VM-Entry (this code have already been there when enabling CPU-side
posted-interrupt along with APICv). Like what I said before, it is a little hard to
get vCPU related information in it, even if we get, it is not accurate and may harm
the performance.(need search)

So only setting KVM_REQ_EVENT in wakeup_handler cannot cover the notification
event for 'POSTED_INTR_VECTOR'.

> 
> The point of this comment is that you can keep the
> 
> "if (kvm_x86_ops->hwapic_irr_update)
> 	kvm_x86_ops->hwapic_irr_update(vcpu,
> 			kvm_lapic_find_highest_irr(vcpu));
> "
> 
> Code inside KVM_REQ_EVENT handling section of vcpu_run, as long as
> wakeup_handler sets KVM_REQ_EVENT.

Please see above.

> 
> > Again, POSTED_INTR_VECTOR interrupt handler may be called very
> frequently,
> > it is a little hard to get vCPU related information in it, even if we get, it is not
> > accurate and may harm the performance.(need search)
> >
> > >
> > > >  	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> > > >  		kvm_apic_accept_events(vcpu);
> > > >  		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
> > > > @@ -6168,13 +6183,6 @@ static int vcpu_enter_guest(struct kvm_vcpu
> > > *vcpu)
> > > >  			kvm_x86_ops->enable_irq_window(vcpu);
> > > >
> > > >  		if (kvm_lapic_enabled(vcpu)) {
> > > > -			/*
> > > > -			 * Update architecture specific hints for APIC
> > > > -			 * virtual interrupt delivery.
> > > > -			 */
> > > > -			if (kvm_x86_ops->hwapic_irr_update)
> > > > -				kvm_x86_ops->hwapic_irr_update(vcpu,
> > > > -					kvm_lapic_find_highest_irr(vcpu));
> > > >  			update_cr8_intercept(vcpu);
> > > >  			kvm_lapic_sync_to_vapic(vcpu);
> > > >  		}
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index 3d7242c..d981d16 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -239,6 +239,9 @@ struct kvm_vcpu {
> > > >  	unsigned long requests;
> > > >  	unsigned long guest_debug;
> > > >
> > > > +	int wakeup_cpu;
> > > > +	struct list_head blocked_vcpu_list;
> > > > +
> > > >  	struct mutex mutex;
> > > >  	struct kvm_run *run;
> > > >
> > > > @@ -282,6 +285,7 @@ struct kvm_vcpu {
> > > >  	} spin_loop;
> > > >  #endif
> > > >  	bool preempted;
> > > > +	bool blocked;
> > > >  	struct kvm_vcpu_arch arch;
> > > >  };
> > >
> > > Please remove blocked and wakeup_cpu, they should not be necessary.
> >
> > Why do you think wakeup_cpu is not needed, when vCPU is blocked,
> > wakeup_cpu saves the cpu which the vCPU is blocked on, after vCPU
> > is woken up, it can run on a different cpu, so we need wakeup_cpu to
> > find the right list to wake up the vCPU.
> 
> If the vCPU was moved it should have updated IRTE destination field
> to the pCPU which it has moved to?

Every time a vCPU is scheduled to a new pCPU, the IRTE destination filed
would be updated accordingly. 

When vCPU is blocked. To wake up the blocked vCPU, we need to find which
list the vCPU is blocked on, and this is what wakeup_cpu used for?


Thanks,
Feng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti March 12, 2015, 1:15 a.m. UTC | #12
On Fri, Mar 06, 2015 at 06:51:52AM +0000, Wu, Feng wrote:
> 
> 
> > -----Original Message-----
> > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > Sent: Wednesday, March 04, 2015 8:06 PM
> > To: Wu, Feng
> > Cc: tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com; x86@kernel.org;
> > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> > is blocked
> > 
> > On Mon, Mar 02, 2015 at 01:36:51PM +0000, Wu, Feng wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > > > Sent: Friday, February 27, 2015 7:41 AM
> > > > To: Wu, Feng
> > > > Cc: tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com;
> > x86@kernel.org;
> > > > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > > > joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> > > > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > > > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > > > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
> > vCPU
> > > > is blocked
> > > >
> > > > On Fri, Dec 12, 2014 at 11:14:58PM +0800, Feng Wu wrote:
> > > > > This patch updates the Posted-Interrupts Descriptor when vCPU
> > > > > is blocked.
> > > > >
> > > > > pre-block:
> > > > > - Add the vCPU to the blocked per-CPU list
> > > > > - Clear 'SN'
> > > > > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> > > > >
> > > > > post-block:
> > > > > - Remove the vCPU from the per-CPU list
> > > > >
> > > > > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > > > > ---
> > > > >  arch/x86/include/asm/kvm_host.h |  2 +
> > > > >  arch/x86/kvm/vmx.c              | 96
> > > > +++++++++++++++++++++++++++++++++++++++++
> > > > >  arch/x86/kvm/x86.c              | 22 +++++++---
> > > > >  include/linux/kvm_host.h        |  4 ++
> > > > >  virt/kvm/kvm_main.c             |  6 +++
> > > > >  5 files changed, 123 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/arch/x86/include/asm/kvm_host.h
> > > > b/arch/x86/include/asm/kvm_host.h
> > > > > index 13e3e40..32c110a 100644
> > > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > > @@ -101,6 +101,8 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t
> > > > base_gfn, int level)
> > > > >
> > > > >  #define ASYNC_PF_PER_VCPU 64
> > > > >
> > > > > +extern void (*wakeup_handler_callback)(void);
> > > > > +
> > > > >  enum kvm_reg {
> > > > >  	VCPU_REGS_RAX = 0,
> > > > >  	VCPU_REGS_RCX = 1,
> > > > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > > > > index bf2e6cd..a1c83a2 100644
> > > > > --- a/arch/x86/kvm/vmx.c
> > > > > +++ b/arch/x86/kvm/vmx.c
> > > > > @@ -832,6 +832,13 @@ static DEFINE_PER_CPU(struct vmcs *,
> > > > current_vmcs);
> > > > >  static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
> > > > >  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
> > > > >
> > > > > +/*
> > > > > + * We maintian a per-CPU linked-list of vCPU, so in wakeup_handler() we
> > > > > + * can find which vCPU should be waken up.
> > > > > + */
> > > > > +static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
> > > > > +static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
> > > > > +
> > > > >  static unsigned long *vmx_io_bitmap_a;
> > > > >  static unsigned long *vmx_io_bitmap_b;
> > > > >  static unsigned long *vmx_msr_bitmap_legacy;
> > > > > @@ -1921,6 +1928,7 @@ static void vmx_vcpu_load(struct kvm_vcpu
> > *vcpu,
> > > > int cpu)
> > > > >  		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > > > >  		struct pi_desc old, new;
> > > > >  		unsigned int dest;
> > > > > +		unsigned long flags;
> > > > >
> > > > >  		memset(&old, 0, sizeof(old));
> > > > >  		memset(&new, 0, sizeof(new));
> > > > > @@ -1942,6 +1950,20 @@ static void vmx_vcpu_load(struct kvm_vcpu
> > > > *vcpu, int cpu)
> > > > >  			new.nv = POSTED_INTR_VECTOR;
> > > > >  		} while (cmpxchg(&pi_desc->control, old.control,
> > > > >  				new.control) != old.control);
> > > > > +
> > > > > +		/*
> > > > > +		 * Delete the vCPU from the related wakeup queue
> > > > > +		 * if we are resuming from blocked state
> > > > > +		 */
> > > > > +		if (vcpu->blocked) {
> > > > > +			vcpu->blocked = false;
> > > > > +			spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> > > > > +				vcpu->wakeup_cpu), flags);
> > > > > +			list_del(&vcpu->blocked_vcpu_list);
> > > > > +
> > 	spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
> > > > > +				vcpu->wakeup_cpu), flags);
> > > > > +			vcpu->wakeup_cpu = -1;
> > > > > +		}
> > > > >  	}
> > > > >  }
> > > > >
> > > > > @@ -1950,6 +1972,9 @@ static void vmx_vcpu_put(struct kvm_vcpu
> > *vcpu)
> > > > >  	if (irq_remapping_cap(IRQ_POSTING_CAP)) {
> > > > >  		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > > > >  		struct pi_desc old, new;
> > > > > +		unsigned long flags;
> > > > > +		int cpu;
> > > > > +		struct cpumask cpu_others_mask;
> > > > >
> > > > >  		memset(&old, 0, sizeof(old));
> > > > >  		memset(&new, 0, sizeof(new));
> > > > > @@ -1961,6 +1986,54 @@ static void vmx_vcpu_put(struct kvm_vcpu
> > > > *vcpu)
> > > > >  				pi_set_sn(&new);
> > > > >  			} while (cmpxchg(&pi_desc->control, old.control,
> > > > >  					new.control) != old.control);
> > > > > +		} else if (vcpu->blocked) {
> > > > > +			/*
> > > > > +			 * The vcpu is blocked on the wait queue.
> > > > > +			 * Store the blocked vCPU on the list of the
> > > > > +			 * vcpu->wakeup_cpu, which is the destination
> > > > > +			 * of the wake-up notification event.
> > > > > +			 */
> > > > > +			vcpu->wakeup_cpu = vcpu->cpu;
> > > > > +			spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> > > > > +					  vcpu->wakeup_cpu), flags);
> > > > > +			list_add_tail(&vcpu->blocked_vcpu_list,
> > > > > +				      &per_cpu(blocked_vcpu_on_cpu,
> > > > > +				      vcpu->wakeup_cpu));
> > > > > +			spin_unlock_irqrestore(
> > > > > +					&per_cpu(blocked_vcpu_on_cpu_lock,
> > > > > +					vcpu->wakeup_cpu), flags);
> > > > > +
> > > > > +			do {
> > > > > +				old.control = new.control = pi_desc->control;
> > > > > +
> > > > > +				/*
> > > > > +				 * We should not block the vCPU if
> > > > > +				 * an interrupt is posted for it.
> > > > > +				 */
> > > > > +				if (pi_test_on(pi_desc) == 1) {
> > > > > +					/*
> > > > > +					 * We need schedule the wakeup worker
> > > > > +					 * on a different cpu other than
> > > > > +					 * vcpu->cpu, because in some case,
> > > > > +					 * schedule_work() will call
> > > > > +					 * try_to_wake_up() which needs acquire
> > > > > +					 * the rq lock. This can cause deadlock.
> > > > > +					 */
> > > > > +					cpumask_copy(&cpu_others_mask,
> > > > > +						     cpu_online_mask);
> > > > > +					cpu_clear(vcpu->cpu, cpu_others_mask);
> > > > > +					cpu = any_online_cpu(cpu_others_mask);
> > > > > +
> > > > > +					schedule_work_on(cpu,
> > > > > +							 &vcpu->wakeup_worker);
> > > > > +				}
> > > > > +
> > > > > +				pi_clear_sn(&new);
> > > > > +
> > > > > +				/* set 'NV' to 'wakeup vector' */
> > > > > +				new.nv = POSTED_INTR_WAKEUP_VECTOR;
> > > > > +			} while (cmpxchg(&pi_desc->control, old.control,
> > > > > +				new.control) != old.control);
> > > > >  		}
> > > >
> > > > This can be done exclusively on HLT emulation, correct? (that is, on
> > > > entry to HLT and exit from HLT).
> > >
> > > Do you mean the following?
> > > In kvm_emulate_halt(), we do:
> > > 1. Add vCPU in the blocking list
> > > 2. Clear 'SN'
> > > 3. set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> > >
> > > In __vcpu_run(), after kvm_vcpu_block(), we remove the vCPU from the
> > > Bloc king list.
> > 
> > Yes (please check its OK to do this...).
> 
> I think about this for some time, and I feel this may be another solution
> to implement it. Do you mind sharing your ideas about why do you think
> this alternative is better than the current one? Thanks a lot!

Two reasons:

1) Because it does not add overhead to vcpu_puts thats are not due to
HLT. Doing so removes the "vcpu->blocked" variable (its implicit in the 
code anyway).
2) Easier to spot races.

Do you have any reason why having the code at vcpu_put/vcpu_load is     
better than the proposal to have the code at kvm_vcpu_block?

> > > > If the vcpu is scheduled out for any other reason (transition to
> > > > userspace or transition to other thread), it will eventually resume
> > > > execution. And in that case, continuation of execution does not depend
> > > > on the event (VT-d interrupt) notification.
> > >
> > > Yes, I think this is true for my current implementation, right?
> > >
> > > >
> > > > There is a race window with the code above, I believe.
> > >
> > > I did careful code review back and forth for the above code, It will
> > > be highly appreciated if you can point out the race window!
> > 
> > So the remapping HW sees either POSTED_INTR_VECTOR or
> > POSTED_INTR_WAKEUP_VECTOR.
> > 
> > You should:
> > 
> > 1. Set POSTED_INTR_WAKEUP_VECTOR.
> > 2. Check for PIR / ON bit, which might have been set by
> > POSTED_INTR_VECTOR notification.
> > 3. emulate HLT.
> 
> My original idea for pre-block operation is:
> 1. Add vCPU to the per-cpu blocking list. Here is the code for this in my patch:
> +                       vcpu->wakeup_cpu = vcpu->cpu;
> +                       spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> +                                         vcpu->wakeup_cpu), flags);
> +                       list_add_tail(&vcpu->blocked_vcpu_list,
> +                                     &per_cpu(blocked_vcpu_on_cpu,
> +                                     vcpu->wakeup_cpu));
> +                       spin_unlock_irqrestore(
> +                                       &per_cpu(blocked_vcpu_on_cpu_lock,
> +                                       vcpu->wakeup_cpu), flags);
> 2. Update Posted-interrupt descriptor, here is the code in my patch:
> +                       do {
> +                               old.control = new.control = pi_desc->control;
> +
> +                               /*
> +                                * We should not block the vCPU if
> +                                * an interrupt is posted for it.
> +                                */
> +                               if (pi_test_on(pi_desc) == 1) {
> +                                       /*
> +                                        * We need schedule the wakeup worker
> +                                        * on a different cpu other than
> +                                        * vcpu->cpu, because in some case,
> +                                        * schedule_work() will call
> +                                        * try_to_wake_up() which needs acquire
> +                                        * the rq lock. This can cause deadlock.
> +                                        */
> +                                       cpumask_copy(&cpu_others_mask,
> +                                                    cpu_online_mask);
> +                                       cpu_clear(vcpu->cpu, cpu_others_mask);
> +                                       cpu = any_online_cpu(cpu_others_mask);
> +
> +                                       schedule_work_on(cpu,
> +                                                        &vcpu->wakeup_worker);
> +                               }
> +
> +                               WARN((pi_desc->sn == 1),
> +                                    "Warning: SN field of posted-interrupts "
> +                                    "is set before blocking\n");
> +
> +                               /* set 'NV' to 'wakeup vector' */
> +                               new.nv = POSTED_INTR_WAKEUP_VECTOR;
> +                       } while (cmpxchg(&pi_desc->control, old.control,
> +                               new.control) != old.control);
> 
> If PIR/ON bit is set by POSTED_INTR_VECTOR notification during the above operation, we will stop
> blocking the vCPU like about. But seems I missed something in the above code which should be in
> my mind from the beginning, I should add a 'break' in the end the above ' if (pi_test_on(pi_desc) == 1) {}',
> so in this case, the 'NV' filed remains unchanged.

Right have to think carefully about all cases.

> > > > >  	}
> > > > >
> > > > > @@ -2842,6 +2915,8 @@ static int hardware_enable(void)
> > > > >  		return -EBUSY;
> > > > >
> > > > >  	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> > > > > +	INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
> > > > > +	spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > > >
> > > > >  	/*
> > > > >  	 * Now we can enable the vmclear operation in kdump
> > > > > @@ -9315,6 +9390,25 @@ static struct kvm_x86_ops vmx_x86_ops = {
> > > > >  	.pi_set_sn = vmx_pi_set_sn,
> > > > >  };
> > > > >
> > > > > +/*
> > > > > + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
> > > > > + */
> > > > > +void wakeup_handler(void)
> > > > > +{
> > > > > +	struct kvm_vcpu *vcpu;
> > > > > +	int cpu = smp_processor_id();
> > > > > +
> > > > > +	spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > > > +	list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
> > > > > +			blocked_vcpu_list) {
> > > > > +		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > > > > +
> > > > > +		if (pi_test_on(pi_desc) == 1)
> > > > > +			kvm_vcpu_kick(vcpu);
> > > > > +	}
> > > > > +	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > > > +}
> > > >
> > > > Looping through all blocked vcpus does not scale:
> > > > Can you allocate more vectors and then multiplex those
> > > > vectors amongst the HLT'ed vcpus?
> > >
> > > I am a little confused about this, can you elaborate it a bit more?
> > > Thanks a lot!
> > 
> > Picture the following overcommitment scenario:
> > 
> > * High ratio of vCPUs/pCPUs, in the ratio 128/1 (this is exaggerated
> > to demonstrate the issue).
> > * Every VT-d interrupt is going to scan 128 entries in the list.
> > 
> > Moreover, the test:
> > 
> > 		if (pi_test_on(pi_desc) == 1)
> > 			kvm_vcpu_kick(vcpu);
> > 
> > Can trigger for vCPUs which have not been waken up due
> > to VT-d interrupts, but for other interrupts.
> > 
> > You can allocate, say 16 vectors on the pCPU for VT-d interrupts:
> > 
> > POSTED_INTERRUPT_WAKEUP_VECTOR_1,
> > POSTED_INTERRUPT_WAKEUP_VECTOR_2,
> > ...
> > 
> 
> Global vector is a limited resources in the system, and this involves
> common x86 interrupt code changes. I am not sure we can allocate
> so many dedicated global vector for KVM usage.

Why not? Have KVM use all free vectors (so if vectors are necessary for
other purposes, people should shrink the KVM vector pool).

BTW the Intel docs talk about that ("one vector per vCPU").

> > > > It seems there is a bunch free:
> > > >
> > > > commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4
> > > > Author: Alex Shi <alex.shi@intel.com>
> > > > Date:   Thu Jun 28 09:02:23 2012 +0800
> > > >
> > > >     x86/tlb: replace INVALIDATE_TLB_VECTOR by
> > CALL_FUNCTION_VECTOR
> > > >
> > > > Can you add only vcpus which have posted IRTEs that point to this pCPU
> > > > to the HLT'ed vcpu lists? (so for example, vcpus without assigned
> > > > devices are not part of the list).
> > >
> > > Is it easy to find whether a vCPU (or the associated domain) has assigned
> > devices?
> > > If so, we can only add those vCPUs with assigned devices.
> > 
> > When configuring IRTE, at kvm_arch_vfio_update_pi_irte?
> 
> Yes.
> 
> > 
> > > > > +
> > > > >  static int __init vmx_init(void)
> > > > >  {
> > > > >  	int r, i, msr;
> > > > > @@ -9429,6 +9523,8 @@ static int __init vmx_init(void)
> > > > >
> > > > >  	update_ple_window_actual_max();
> > > > >
> > > > > +	wakeup_handler_callback = wakeup_handler;
> > > > > +
> > > > >  	return 0;
> > > > >
> > > > >  out7:
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index 0033df3..1551a46 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -6152,6 +6152,21 @@ static int vcpu_enter_guest(struct kvm_vcpu
> > > > *vcpu)
> > > > >  			kvm_vcpu_reload_apic_access_page(vcpu);
> > > > >  	}
> > > > >
> > > > > +	/*
> > > > > +	 * Since posted-interrupts can be set by VT-d HW now, in this
> > > > > +	 * case, KVM_REQ_EVENT is not set. We move the following
> > > > > +	 * operations out of the if statement.
> > > > > +	 */
> > > > > +	if (kvm_lapic_enabled(vcpu)) {
> > > > > +		/*
> > > > > +		 * Update architecture specific hints for APIC
> > > > > +		 * virtual interrupt delivery.
> > > > > +		 */
> > > > > +		if (kvm_x86_ops->hwapic_irr_update)
> > > > > +			kvm_x86_ops->hwapic_irr_update(vcpu,
> > > > > +				kvm_lapic_find_highest_irr(vcpu));
> > > > > +	}
> > > > > +
> > > >
> > > > This is a hot fast path. You can set KVM_REQ_EVENT from wakeup_handler.
> > >
> > > I am afraid Setting KVM_REQ_EVENT from wakeup_handler doesn't help
> > much,
> > > if vCPU is running in ROOT mode, and VT-d hardware issues an notification
> > event,
> > > POSTED_INTR_VECTOR interrupt handler will be called.
> > 
> > If vCPU is in root mode, remapping HW will find IRTE configured with
> > vector == POSTED_INTR_WAKEUP_VECTOR, use that vector, which will
> > VM-exit, and execute the interrupt handler wakeup_handler. Right?
> 
> There are two cases:
> Case 1: vCPU is blocked, so it is in root mode, this is what you described above.
> Case 2, vCPU is running in root mode, such as, handling vm-exits, in this case,
> the notification vector is 'POSTED_INTR_VECTOR', and if external interrupts
> from assigned devices happen, the handled of 'POSTED_INTR_VECTOR' will
> be called ( it is 'smp_kvm_posted_intr_ipi' in fact), this routine doesn't need
> do real things, since the pending interrupts in PIR will be synced to vIRR before
> VM-Entry (this code have already been there when enabling CPU-side
> posted-interrupt along with APICv). Like what I said before, it is a little hard to
> get vCPU related information in it, even if we get, it is not accurate and may harm
> the performance.(need search)
> 
> So only setting KVM_REQ_EVENT in wakeup_handler cannot cover the notification
> event for 'POSTED_INTR_VECTOR'.
> 
> > 
> > The point of this comment is that you can keep the
> > 
> > "if (kvm_x86_ops->hwapic_irr_update)
> > 	kvm_x86_ops->hwapic_irr_update(vcpu,
> > 			kvm_lapic_find_highest_irr(vcpu));
> > "
> > 
> > Code inside KVM_REQ_EVENT handling section of vcpu_run, as long as
> > wakeup_handler sets KVM_REQ_EVENT.
> 
> Please see above.

OK can you set KVM_REQ_EVENT in case the ON bit is set,
after disabling interrupts ?

kvm_lapic_find_highest_irr(vcpu) eats some cache 
(4 cachelines) versus 1 cacheline for reading ON bit.

> > > > Please remove blocked and wakeup_cpu, they should not be necessary.
> > >
> > > Why do you think wakeup_cpu is not needed, when vCPU is blocked,
> > > wakeup_cpu saves the cpu which the vCPU is blocked on, after vCPU
> > > is woken up, it can run on a different cpu, so we need wakeup_cpu to
> > > find the right list to wake up the vCPU.
> > 
> > If the vCPU was moved it should have updated IRTE destination field
> > to the pCPU which it has moved to?
> 
> Every time a vCPU is scheduled to a new pCPU, the IRTE destination filed
> would be updated accordingly. 
> 
> When vCPU is blocked. To wake up the blocked vCPU, we need to find which
> list the vCPU is blocked on, and this is what wakeup_cpu used for?

Right, perhaps prev_vcpu is a better name.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu, Feng March 16, 2015, 11:42 a.m. UTC | #13
> -----Original Message-----
> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Thursday, March 12, 2015 9:15 AM
> To: Wu, Feng
> Cc: tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com; x86@kernel.org;
> gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
> 
> On Fri, Mar 06, 2015 at 06:51:52AM +0000, Wu, Feng wrote:
> >
> >
> > > -----Original Message-----
> > > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > > Sent: Wednesday, March 04, 2015 8:06 PM
> > > To: Wu, Feng
> > > Cc: tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com;
> x86@kernel.org;
> > > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > > joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> > > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
> vCPU
> > > is blocked
> > >
> > > On Mon, Mar 02, 2015 at 01:36:51PM +0000, Wu, Feng wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > > > > Sent: Friday, February 27, 2015 7:41 AM
> > > > > To: Wu, Feng
> > > > > Cc: tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com;
> > > x86@kernel.org;
> > > > > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > > > > joro@8bytes.org; alex.williamson@redhat.com;
> jiang.liu@linux.intel.com;
> > > > > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > > > > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > > > > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
> > > vCPU
> > > > > is blocked
> > > > >
> > > > > On Fri, Dec 12, 2014 at 11:14:58PM +0800, Feng Wu wrote:
> > > > > > This patch updates the Posted-Interrupts Descriptor when vCPU
> > > > > > is blocked.
> > > > > >
> > > > > > pre-block:
> > > > > > - Add the vCPU to the blocked per-CPU list
> > > > > > - Clear 'SN'
> > > > > > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> > > > > >
> > > > > > post-block:
> > > > > > - Remove the vCPU from the per-CPU list
> > > > > >
> > > > > > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > > > > > ---
> > > > > >  arch/x86/include/asm/kvm_host.h |  2 +
> > > > > >  arch/x86/kvm/vmx.c              | 96
> > > > > +++++++++++++++++++++++++++++++++++++++++
> > > > > >  arch/x86/kvm/x86.c              | 22 +++++++---
> > > > > >  include/linux/kvm_host.h        |  4 ++
> > > > > >  virt/kvm/kvm_main.c             |  6 +++
> > > > > >  5 files changed, 123 insertions(+), 7 deletions(-)
> > > > > >
> > > > > > diff --git a/arch/x86/include/asm/kvm_host.h
> > > > > b/arch/x86/include/asm/kvm_host.h
> > > > > > index 13e3e40..32c110a 100644
> > > > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > > > @@ -101,6 +101,8 @@ static inline gfn_t gfn_to_index(gfn_t gfn,
> gfn_t
> > > > > base_gfn, int level)
> > > > > >
> > > > > >  #define ASYNC_PF_PER_VCPU 64
> > > > > >
> > > > > > +extern void (*wakeup_handler_callback)(void);
> > > > > > +
> > > > > >  enum kvm_reg {
> > > > > >  	VCPU_REGS_RAX = 0,
> > > > > >  	VCPU_REGS_RCX = 1,
> > > > > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > > > > > index bf2e6cd..a1c83a2 100644
> > > > > > --- a/arch/x86/kvm/vmx.c
> > > > > > +++ b/arch/x86/kvm/vmx.c
> > > > > > @@ -832,6 +832,13 @@ static DEFINE_PER_CPU(struct vmcs *,
> > > > > current_vmcs);
> > > > > >  static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
> > > > > >  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
> > > > > >
> > > > > > +/*
> > > > > > + * We maintian a per-CPU linked-list of vCPU, so in wakeup_handler()
> we
> > > > > > + * can find which vCPU should be waken up.
> > > > > > + */
> > > > > > +static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
> > > > > > +static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
> > > > > > +
> > > > > >  static unsigned long *vmx_io_bitmap_a;
> > > > > >  static unsigned long *vmx_io_bitmap_b;
> > > > > >  static unsigned long *vmx_msr_bitmap_legacy;
> > > > > > @@ -1921,6 +1928,7 @@ static void vmx_vcpu_load(struct kvm_vcpu
> > > *vcpu,
> > > > > int cpu)
> > > > > >  		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > > > > >  		struct pi_desc old, new;
> > > > > >  		unsigned int dest;
> > > > > > +		unsigned long flags;
> > > > > >
> > > > > >  		memset(&old, 0, sizeof(old));
> > > > > >  		memset(&new, 0, sizeof(new));
> > > > > > @@ -1942,6 +1950,20 @@ static void vmx_vcpu_load(struct
> kvm_vcpu
> > > > > *vcpu, int cpu)
> > > > > >  			new.nv = POSTED_INTR_VECTOR;
> > > > > >  		} while (cmpxchg(&pi_desc->control, old.control,
> > > > > >  				new.control) != old.control);
> > > > > > +
> > > > > > +		/*
> > > > > > +		 * Delete the vCPU from the related wakeup queue
> > > > > > +		 * if we are resuming from blocked state
> > > > > > +		 */
> > > > > > +		if (vcpu->blocked) {
> > > > > > +			vcpu->blocked = false;
> > > > > > +
> 	spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> > > > > > +				vcpu->wakeup_cpu), flags);
> > > > > > +			list_del(&vcpu->blocked_vcpu_list);
> > > > > > +
> > > 	spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
> > > > > > +				vcpu->wakeup_cpu), flags);
> > > > > > +			vcpu->wakeup_cpu = -1;
> > > > > > +		}
> > > > > >  	}
> > > > > >  }
> > > > > >
> > > > > > @@ -1950,6 +1972,9 @@ static void vmx_vcpu_put(struct kvm_vcpu
> > > *vcpu)
> > > > > >  	if (irq_remapping_cap(IRQ_POSTING_CAP)) {
> > > > > >  		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > > > > >  		struct pi_desc old, new;
> > > > > > +		unsigned long flags;
> > > > > > +		int cpu;
> > > > > > +		struct cpumask cpu_others_mask;
> > > > > >
> > > > > >  		memset(&old, 0, sizeof(old));
> > > > > >  		memset(&new, 0, sizeof(new));
> > > > > > @@ -1961,6 +1986,54 @@ static void vmx_vcpu_put(struct
> kvm_vcpu
> > > > > *vcpu)
> > > > > >  				pi_set_sn(&new);
> > > > > >  			} while (cmpxchg(&pi_desc->control, old.control,
> > > > > >  					new.control) != old.control);
> > > > > > +		} else if (vcpu->blocked) {
> > > > > > +			/*
> > > > > > +			 * The vcpu is blocked on the wait queue.
> > > > > > +			 * Store the blocked vCPU on the list of the
> > > > > > +			 * vcpu->wakeup_cpu, which is the destination
> > > > > > +			 * of the wake-up notification event.
> > > > > > +			 */
> > > > > > +			vcpu->wakeup_cpu = vcpu->cpu;
> > > > > > +
> 	spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> > > > > > +					  vcpu->wakeup_cpu), flags);
> > > > > > +			list_add_tail(&vcpu->blocked_vcpu_list,
> > > > > > +				      &per_cpu(blocked_vcpu_on_cpu,
> > > > > > +				      vcpu->wakeup_cpu));
> > > > > > +			spin_unlock_irqrestore(
> > > > > > +					&per_cpu(blocked_vcpu_on_cpu_lock,
> > > > > > +					vcpu->wakeup_cpu), flags);
> > > > > > +
> > > > > > +			do {
> > > > > > +				old.control = new.control = pi_desc->control;
> > > > > > +
> > > > > > +				/*
> > > > > > +				 * We should not block the vCPU if
> > > > > > +				 * an interrupt is posted for it.
> > > > > > +				 */
> > > > > > +				if (pi_test_on(pi_desc) == 1) {
> > > > > > +					/*
> > > > > > +					 * We need schedule the wakeup worker
> > > > > > +					 * on a different cpu other than
> > > > > > +					 * vcpu->cpu, because in some case,
> > > > > > +					 * schedule_work() will call
> > > > > > +					 * try_to_wake_up() which needs acquire
> > > > > > +					 * the rq lock. This can cause deadlock.
> > > > > > +					 */
> > > > > > +					cpumask_copy(&cpu_others_mask,
> > > > > > +						     cpu_online_mask);
> > > > > > +					cpu_clear(vcpu->cpu, cpu_others_mask);
> > > > > > +					cpu = any_online_cpu(cpu_others_mask);
> > > > > > +
> > > > > > +					schedule_work_on(cpu,
> > > > > > +							 &vcpu->wakeup_worker);
> > > > > > +				}
> > > > > > +
> > > > > > +				pi_clear_sn(&new);
> > > > > > +
> > > > > > +				/* set 'NV' to 'wakeup vector' */
> > > > > > +				new.nv = POSTED_INTR_WAKEUP_VECTOR;
> > > > > > +			} while (cmpxchg(&pi_desc->control, old.control,
> > > > > > +				new.control) != old.control);
> > > > > >  		}
> > > > >
> > > > > This can be done exclusively on HLT emulation, correct? (that is, on
> > > > > entry to HLT and exit from HLT).
> > > >
> > > > Do you mean the following?
> > > > In kvm_emulate_halt(), we do:
> > > > 1. Add vCPU in the blocking list
> > > > 2. Clear 'SN'
> > > > 3. set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> > > >
> > > > In __vcpu_run(), after kvm_vcpu_block(), we remove the vCPU from the
> > > > Bloc king list.
> > >
> > > Yes (please check its OK to do this...).
> >
> > I think about this for some time, and I feel this may be another solution
> > to implement it. Do you mind sharing your ideas about why do you think
> > this alternative is better than the current one? Thanks a lot!
> 
> Two reasons:
> 
> 1) Because it does not add overhead to vcpu_puts thats are not due to
> HLT. Doing so removes the "vcpu->blocked" variable (its implicit in the
> code anyway).
> 2) Easier to spot races.
> 
> Do you have any reason why having the code at vcpu_put/vcpu_load is
> better than the proposal to have the code at kvm_vcpu_block?

I think your proposal is good, I just want to better understand your idea here.:)

One thing, even we put the code to kvm_vcpu_block, we still need to add code
at vcpu_put/vcpu_load for the preemption case like what I did now.

> 
> > > > > If the vcpu is scheduled out for any other reason (transition to
> > > > > userspace or transition to other thread), it will eventually resume
> > > > > execution. And in that case, continuation of execution does not depend
> > > > > on the event (VT-d interrupt) notification.
> > > >
> > > > Yes, I think this is true for my current implementation, right?
> > > >
> > > > >
> > > > > There is a race window with the code above, I believe.
> > > >
> > > > I did careful code review back and forth for the above code, It will
> > > > be highly appreciated if you can point out the race window!
> > >
> > > So the remapping HW sees either POSTED_INTR_VECTOR or
> > > POSTED_INTR_WAKEUP_VECTOR.
> > >
> > > You should:
> > >
> > > 1. Set POSTED_INTR_WAKEUP_VECTOR.
> > > 2. Check for PIR / ON bit, which might have been set by
> > > POSTED_INTR_VECTOR notification.
> > > 3. emulate HLT.
> >
> > My original idea for pre-block operation is:
> > 1. Add vCPU to the per-cpu blocking list. Here is the code for this in my patch:
> > +                       vcpu->wakeup_cpu = vcpu->cpu;
> > +
> spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> > +                                         vcpu->wakeup_cpu),
> flags);
> > +                       list_add_tail(&vcpu->blocked_vcpu_list,
> > +
> &per_cpu(blocked_vcpu_on_cpu,
> > +                                     vcpu->wakeup_cpu));
> > +                       spin_unlock_irqrestore(
> > +
> &per_cpu(blocked_vcpu_on_cpu_lock,
> > +                                       vcpu->wakeup_cpu), flags);
> > 2. Update Posted-interrupt descriptor, here is the code in my patch:
> > +                       do {
> > +                               old.control = new.control =
> pi_desc->control;
> > +
> > +                               /*
> > +                                * We should not block the vCPU if
> > +                                * an interrupt is posted for it.
> > +                                */
> > +                               if (pi_test_on(pi_desc) == 1) {
> > +                                       /*
> > +                                        * We need schedule the
> wakeup worker
> > +                                        * on a different cpu other
> than
> > +                                        * vcpu->cpu, because in
> some case,
> > +                                        * schedule_work() will call
> > +                                        * try_to_wake_up() which
> needs acquire
> > +                                        * the rq lock. This can
> cause deadlock.
> > +                                        */
> > +
> cpumask_copy(&cpu_others_mask,
> > +
> cpu_online_mask);
> > +                                       cpu_clear(vcpu->cpu,
> cpu_others_mask);
> > +                                       cpu =
> any_online_cpu(cpu_others_mask);
> > +
> > +                                       schedule_work_on(cpu,
> > +
> &vcpu->wakeup_worker);
> > +                               }
> > +
> > +                               WARN((pi_desc->sn == 1),
> > +                                    "Warning: SN field of
> posted-interrupts "
> > +                                    "is set before blocking\n");
> > +
> > +                               /* set 'NV' to 'wakeup vector' */
> > +                               new.nv =
> POSTED_INTR_WAKEUP_VECTOR;
> > +                       } while (cmpxchg(&pi_desc->control,
> old.control,
> > +                               new.control) != old.control);
> >
> > If PIR/ON bit is set by POSTED_INTR_VECTOR notification during the above
> operation, we will stop
> > blocking the vCPU like about. But seems I missed something in the above
> code which should be in
> > my mind from the beginning, I should add a 'break' in the end the above ' if
> (pi_test_on(pi_desc) == 1) {}',
> > so in this case, the 'NV' filed remains unchanged.
> 
> Right have to think carefully about all cases.
> 
> > > > > >  	}
> > > > > >
> > > > > > @@ -2842,6 +2915,8 @@ static int hardware_enable(void)
> > > > > >  		return -EBUSY;
> > > > > >
> > > > > >  	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> > > > > > +	INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
> > > > > > +	spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > > > >
> > > > > >  	/*
> > > > > >  	 * Now we can enable the vmclear operation in kdump
> > > > > > @@ -9315,6 +9390,25 @@ static struct kvm_x86_ops vmx_x86_ops =
> {
> > > > > >  	.pi_set_sn = vmx_pi_set_sn,
> > > > > >  };
> > > > > >
> > > > > > +/*
> > > > > > + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
> > > > > > + */
> > > > > > +void wakeup_handler(void)
> > > > > > +{
> > > > > > +	struct kvm_vcpu *vcpu;
> > > > > > +	int cpu = smp_processor_id();
> > > > > > +
> > > > > > +	spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > > > > +	list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu,
> cpu),
> > > > > > +			blocked_vcpu_list) {
> > > > > > +		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > > > > > +
> > > > > > +		if (pi_test_on(pi_desc) == 1)
> > > > > > +			kvm_vcpu_kick(vcpu);
> > > > > > +	}
> > > > > > +	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > > > > +}
> > > > >
> > > > > Looping through all blocked vcpus does not scale:
> > > > > Can you allocate more vectors and then multiplex those
> > > > > vectors amongst the HLT'ed vcpus?
> > > >
> > > > I am a little confused about this, can you elaborate it a bit more?
> > > > Thanks a lot!
> > >
> > > Picture the following overcommitment scenario:
> > >
> > > * High ratio of vCPUs/pCPUs, in the ratio 128/1 (this is exaggerated
> > > to demonstrate the issue).
> > > * Every VT-d interrupt is going to scan 128 entries in the list.
> > >
> > > Moreover, the test:
> > >
> > > 		if (pi_test_on(pi_desc) == 1)
> > > 			kvm_vcpu_kick(vcpu);
> > >
> > > Can trigger for vCPUs which have not been waken up due
> > > to VT-d interrupts, but for other interrupts.
> > >
> > > You can allocate, say 16 vectors on the pCPU for VT-d interrupts:
> > >
> > > POSTED_INTERRUPT_WAKEUP_VECTOR_1,
> > > POSTED_INTERRUPT_WAKEUP_VECTOR_2,
> > > ...
> > >
> >
> > Global vector is a limited resources in the system, and this involves
> > common x86 interrupt code changes. I am not sure we can allocate
> > so many dedicated global vector for KVM usage.
> 
> Why not? Have KVM use all free vectors (so if vectors are necessary for
> other purposes, people should shrink the KVM vector pool).

If we want to allocate more global vector for this usage, we need hpa's
input about it. Peter, what is your opinion?

> 
> BTW the Intel docs talk about that ("one vector per vCPU").
Yes, the Spec talks about this, but it is more complex using one vector per vCPU.

> 
> > > > > It seems there is a bunch free:
> > > > >
> > > > > commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4
> > > > > Author: Alex Shi <alex.shi@intel.com>
> > > > > Date:   Thu Jun 28 09:02:23 2012 +0800
> > > > >
> > > > >     x86/tlb: replace INVALIDATE_TLB_VECTOR by
> > > CALL_FUNCTION_VECTOR
> > > > >
> > > > > Can you add only vcpus which have posted IRTEs that point to this pCPU
> > > > > to the HLT'ed vcpu lists? (so for example, vcpus without assigned
> > > > > devices are not part of the list).
> > > >
> > > > Is it easy to find whether a vCPU (or the associated domain) has assigned
> > > devices?
> > > > If so, we can only add those vCPUs with assigned devices.
> > >
> > > When configuring IRTE, at kvm_arch_vfio_update_pi_irte?
> >
> > Yes.
> >
> > >
> > > > > > +
> > > > > >  static int __init vmx_init(void)
> > > > > >  {
> > > > > >  	int r, i, msr;
> > > > > > @@ -9429,6 +9523,8 @@ static int __init vmx_init(void)
> > > > > >
> > > > > >  	update_ple_window_actual_max();
> > > > > >
> > > > > > +	wakeup_handler_callback = wakeup_handler;
> > > > > > +
> > > > > >  	return 0;
> > > > > >
> > > > > >  out7:
> > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > index 0033df3..1551a46 100644
> > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > @@ -6152,6 +6152,21 @@ static int vcpu_enter_guest(struct
> kvm_vcpu
> > > > > *vcpu)
> > > > > >  			kvm_vcpu_reload_apic_access_page(vcpu);
> > > > > >  	}
> > > > > >
> > > > > > +	/*
> > > > > > +	 * Since posted-interrupts can be set by VT-d HW now, in this
> > > > > > +	 * case, KVM_REQ_EVENT is not set. We move the following
> > > > > > +	 * operations out of the if statement.
> > > > > > +	 */
> > > > > > +	if (kvm_lapic_enabled(vcpu)) {
> > > > > > +		/*
> > > > > > +		 * Update architecture specific hints for APIC
> > > > > > +		 * virtual interrupt delivery.
> > > > > > +		 */
> > > > > > +		if (kvm_x86_ops->hwapic_irr_update)
> > > > > > +			kvm_x86_ops->hwapic_irr_update(vcpu,
> > > > > > +				kvm_lapic_find_highest_irr(vcpu));
> > > > > > +	}
> > > > > > +
> > > > >
> > > > > This is a hot fast path. You can set KVM_REQ_EVENT from
> wakeup_handler.
> > > >
> > > > I am afraid Setting KVM_REQ_EVENT from wakeup_handler doesn't help
> > > much,
> > > > if vCPU is running in ROOT mode, and VT-d hardware issues an notification
> > > event,
> > > > POSTED_INTR_VECTOR interrupt handler will be called.
> > >
> > > If vCPU is in root mode, remapping HW will find IRTE configured with
> > > vector == POSTED_INTR_WAKEUP_VECTOR, use that vector, which will
> > > VM-exit, and execute the interrupt handler wakeup_handler. Right?
> >
> > There are two cases:
> > Case 1: vCPU is blocked, so it is in root mode, this is what you described
> above.
> > Case 2, vCPU is running in root mode, such as, handling vm-exits, in this case,
> > the notification vector is 'POSTED_INTR_VECTOR', and if external interrupts
> > from assigned devices happen, the handled of 'POSTED_INTR_VECTOR' will
> > be called ( it is 'smp_kvm_posted_intr_ipi' in fact), this routine doesn't need
> > do real things, since the pending interrupts in PIR will be synced to vIRR
> before
> > VM-Entry (this code have already been there when enabling CPU-side
> > posted-interrupt along with APICv). Like what I said before, it is a little hard
> to
> > get vCPU related information in it, even if we get, it is not accurate and may
> harm
> > the performance.(need search)
> >
> > So only setting KVM_REQ_EVENT in wakeup_handler cannot cover the
> notification
> > event for 'POSTED_INTR_VECTOR'.
> >
> > >
> > > The point of this comment is that you can keep the
> > >
> > > "if (kvm_x86_ops->hwapic_irr_update)
> > > 	kvm_x86_ops->hwapic_irr_update(vcpu,
> > > 			kvm_lapic_find_highest_irr(vcpu));
> > > "
> > >
> > > Code inside KVM_REQ_EVENT handling section of vcpu_run, as long as
> > > wakeup_handler sets KVM_REQ_EVENT.
> >
> > Please see above.
> 
> OK can you set KVM_REQ_EVENT in case the ON bit is set,
> after disabling interrupts ?
> 
Currently, the following code is executed before local_irq_disable() is called,
so do you mean 1)moving local_irq_disable() to the place before it. 2) after interrupt
is disabled, set KVM_REQ_EVENT in case the ON bit is set?

"if (kvm_x86_ops->hwapic_irr_update)
	kvm_x86_ops->hwapic_irr_update(vcpu,
			kvm_lapic_find_highest_irr(vcpu));

> kvm_lapic_find_highest_irr(vcpu) eats some cache
> (4 cachelines) versus 1 cacheline for reading ON bit.
> 
> > > > > Please remove blocked and wakeup_cpu, they should not be necessary.
> > > >
> > > > Why do you think wakeup_cpu is not needed, when vCPU is blocked,
> > > > wakeup_cpu saves the cpu which the vCPU is blocked on, after vCPU
> > > > is woken up, it can run on a different cpu, so we need wakeup_cpu to
> > > > find the right list to wake up the vCPU.
> > >
> > > If the vCPU was moved it should have updated IRTE destination field
> > > to the pCPU which it has moved to?
> >
> > Every time a vCPU is scheduled to a new pCPU, the IRTE destination filed
> > would be updated accordingly.
> >
> > When vCPU is blocked. To wake up the blocked vCPU, we need to find which
> > list the vCPU is blocked on, and this is what wakeup_cpu used for?
> 
> Right, perhaps prev_vcpu is a better name.

Do you mean "prev_pcpu"?

Thanks,
Feng

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti March 25, 2015, 11:17 p.m. UTC | #14
On Mon, Mar 16, 2015 at 11:42:06AM +0000, Wu, Feng wrote:
> > Do you have any reason why having the code at vcpu_put/vcpu_load is
> > better than the proposal to have the code at kvm_vcpu_block?
> 
> I think your proposal is good, I just want to better understand your idea here.:)

Reduce the overhead of vcpu sched in / vcpu sched out, basically.

> One thing, even we put the code to kvm_vcpu_block, we still need to add code
> at vcpu_put/vcpu_load for the preemption case like what I did now.
> 
> > 
> > >
> > > Global vector is a limited resources in the system, and this involves
> > > common x86 interrupt code changes. I am not sure we can allocate
> > > so many dedicated global vector for KVM usage.
> > 
> > Why not? Have KVM use all free vectors (so if vectors are necessary for
> > other purposes, people should shrink the KVM vector pool).
> 
> If we want to allocate more global vector for this usage, we need hpa's
> input about it. Peter, what is your opinion?

Peter?

> > BTW the Intel docs talk about that ("one vector per vCPU").
> Yes, the Spec talks about this, but it is more complex using one vector per vCPU.
> 
> > 
> > > > > > It seems there is a bunch free:
> > > > > >
> > > > > > commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4
> > > > > > Author: Alex Shi <alex.shi@intel.com>
> > > > > > Date:   Thu Jun 28 09:02:23 2012 +0800
> > > > > >
> > > > > >     x86/tlb: replace INVALIDATE_TLB_VECTOR by
> > > > CALL_FUNCTION_VECTOR
> > > > > >
> > > > > > Can you add only vcpus which have posted IRTEs that point to this pCPU
> > > > > > to the HLT'ed vcpu lists? (so for example, vcpus without assigned
> > > > > > devices are not part of the list).
> > > > >
> > > > > Is it easy to find whether a vCPU (or the associated domain) has assigned
> > > > devices?
> > > > > If so, we can only add those vCPUs with assigned devices.
> > > >
> > > > When configuring IRTE, at kvm_arch_vfio_update_pi_irte?
> > >
> > > Yes.
> > >
> > > >
> > > > > > > +
> > > > > > >  static int __init vmx_init(void)
> > > > > > >  {
> > > > > > >  	int r, i, msr;
> > > > > > > @@ -9429,6 +9523,8 @@ static int __init vmx_init(void)
> > > > > > >
> > > > > > >  	update_ple_window_actual_max();
> > > > > > >
> > > > > > > +	wakeup_handler_callback = wakeup_handler;
> > > > > > > +
> > > > > > >  	return 0;
> > > > > > >
> > > > > > >  out7:
> > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > > index 0033df3..1551a46 100644
> > > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > > @@ -6152,6 +6152,21 @@ static int vcpu_enter_guest(struct
> > kvm_vcpu
> > > > > > *vcpu)
> > > > > > >  			kvm_vcpu_reload_apic_access_page(vcpu);
> > > > > > >  	}
> > > > > > >
> > > > > > > +	/*
> > > > > > > +	 * Since posted-interrupts can be set by VT-d HW now, in this
> > > > > > > +	 * case, KVM_REQ_EVENT is not set. We move the following
> > > > > > > +	 * operations out of the if statement.
> > > > > > > +	 */
> > > > > > > +	if (kvm_lapic_enabled(vcpu)) {
> > > > > > > +		/*
> > > > > > > +		 * Update architecture specific hints for APIC
> > > > > > > +		 * virtual interrupt delivery.
> > > > > > > +		 */
> > > > > > > +		if (kvm_x86_ops->hwapic_irr_update)
> > > > > > > +			kvm_x86_ops->hwapic_irr_update(vcpu,
> > > > > > > +				kvm_lapic_find_highest_irr(vcpu));
> > > > > > > +	}
> > > > > > > +
> > > > > >
> > > > > > This is a hot fast path. You can set KVM_REQ_EVENT from
> > wakeup_handler.
> > > > >
> > > > > I am afraid Setting KVM_REQ_EVENT from wakeup_handler doesn't help
> > > > much,
> > > > > if vCPU is running in ROOT mode, and VT-d hardware issues an notification
> > > > event,
> > > > > POSTED_INTR_VECTOR interrupt handler will be called.
> > > >
> > > > If vCPU is in root mode, remapping HW will find IRTE configured with
> > > > vector == POSTED_INTR_WAKEUP_VECTOR, use that vector, which will
> > > > VM-exit, and execute the interrupt handler wakeup_handler. Right?
> > >
> > > There are two cases:
> > > Case 1: vCPU is blocked, so it is in root mode, this is what you described
> > above.
> > > Case 2, vCPU is running in root mode, such as, handling vm-exits, in this case,
> > > the notification vector is 'POSTED_INTR_VECTOR', and if external interrupts
> > > from assigned devices happen, the handled of 'POSTED_INTR_VECTOR' will
> > > be called ( it is 'smp_kvm_posted_intr_ipi' in fact), this routine doesn't need
> > > do real things, since the pending interrupts in PIR will be synced to vIRR
> > before
> > > VM-Entry (this code have already been there when enabling CPU-side
> > > posted-interrupt along with APICv). Like what I said before, it is a little hard
> > to
> > > get vCPU related information in it, even if we get, it is not accurate and may
> > harm
> > > the performance.(need search)
> > >
> > > So only setting KVM_REQ_EVENT in wakeup_handler cannot cover the
> > notification
> > > event for 'POSTED_INTR_VECTOR'.
> > >
> > > >
> > > > The point of this comment is that you can keep the
> > > >
> > > > "if (kvm_x86_ops->hwapic_irr_update)
> > > > 	kvm_x86_ops->hwapic_irr_update(vcpu,
> > > > 			kvm_lapic_find_highest_irr(vcpu));
> > > > "
> > > >
> > > > Code inside KVM_REQ_EVENT handling section of vcpu_run, as long as
> > > > wakeup_handler sets KVM_REQ_EVENT.
> > >
> > > Please see above.
> > 
> > OK can you set KVM_REQ_EVENT in case the ON bit is set,
> > after disabling interrupts ?
> > 
> Currently, the following code is executed before local_irq_disable() is called,
> so do you mean 1)moving local_irq_disable() to the place before it. 2) after interrupt
> is disabled, set KVM_REQ_EVENT in case the ON bit is set?

2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit 
is set.

> 
> "if (kvm_x86_ops->hwapic_irr_update)
> 	kvm_x86_ops->hwapic_irr_update(vcpu,
> 			kvm_lapic_find_highest_irr(vcpu));
> 
> > kvm_lapic_find_highest_irr(vcpu) eats some cache
> > (4 cachelines) versus 1 cacheline for reading ON bit.
> > 
> > > > > > Please remove blocked and wakeup_cpu, they should not be necessary.
> > > > >
> > > > > Why do you think wakeup_cpu is not needed, when vCPU is blocked,
> > > > > wakeup_cpu saves the cpu which the vCPU is blocked on, after vCPU
> > > > > is woken up, it can run on a different cpu, so we need wakeup_cpu to
> > > > > find the right list to wake up the vCPU.
> > > >
> > > > If the vCPU was moved it should have updated IRTE destination field
> > > > to the pCPU which it has moved to?
> > >
> > > Every time a vCPU is scheduled to a new pCPU, the IRTE destination filed
> > > would be updated accordingly.
> > >
> > > When vCPU is blocked. To wake up the blocked vCPU, we need to find which
> > > list the vCPU is blocked on, and this is what wakeup_cpu used for?
> > 
> > Right, perhaps prev_vcpu is a better name.
> 
> Do you mean "prev_pcpu"?

Yes.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu, Feng March 27, 2015, 6:34 a.m. UTC | #15
> -----Original Message-----
> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Thursday, March 26, 2015 7:18 AM
> To: Wu, Feng; hpa@zytor.com
> Cc: tglx@linutronix.de; mingo@redhat.com; hpa@zytor.com; x86@kernel.org;
> gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
> 
> On Mon, Mar 16, 2015 at 11:42:06AM +0000, Wu, Feng wrote:
> > > Do you have any reason why having the code at vcpu_put/vcpu_load is
> > > better than the proposal to have the code at kvm_vcpu_block?
> >
> > I think your proposal is good, I just want to better understand your idea
> here.:)
> 
> Reduce the overhead of vcpu sched in / vcpu sched out, basically.
> 
> > One thing, even we put the code to kvm_vcpu_block, we still need to add
> code
> > at vcpu_put/vcpu_load for the preemption case like what I did now.
> >
> > >
> > > >
> > > > Global vector is a limited resources in the system, and this involves
> > > > common x86 interrupt code changes. I am not sure we can allocate
> > > > so many dedicated global vector for KVM usage.
> > >
> > > Why not? Have KVM use all free vectors (so if vectors are necessary for
> > > other purposes, people should shrink the KVM vector pool).
> >
> > If we want to allocate more global vector for this usage, we need hpa's
> > input about it. Peter, what is your opinion?
> 
> Peter?
> 
> > > BTW the Intel docs talk about that ("one vector per vCPU").
> > Yes, the Spec talks about this, but it is more complex using one vector per
> vCPU.
> >
> > >
> > > > > > > It seems there is a bunch free:
> > > > > > >
> > > > > > > commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4
> > > > > > > Author: Alex Shi <alex.shi@intel.com>
> > > > > > > Date:   Thu Jun 28 09:02:23 2012 +0800
> > > > > > >
> > > > > > >     x86/tlb: replace INVALIDATE_TLB_VECTOR by
> > > > > CALL_FUNCTION_VECTOR
> > > > > > >
> > > > > > > Can you add only vcpus which have posted IRTEs that point to this
> pCPU
> > > > > > > to the HLT'ed vcpu lists? (so for example, vcpus without assigned
> > > > > > > devices are not part of the list).
> > > > > >
> > > > > > Is it easy to find whether a vCPU (or the associated domain) has
> assigned
> > > > > devices?
> > > > > > If so, we can only add those vCPUs with assigned devices.
> > > > >
> > > > > When configuring IRTE, at kvm_arch_vfio_update_pi_irte?
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > > > > +
> > > > > > > >  static int __init vmx_init(void)
> > > > > > > >  {
> > > > > > > >  	int r, i, msr;
> > > > > > > > @@ -9429,6 +9523,8 @@ static int __init vmx_init(void)
> > > > > > > >
> > > > > > > >  	update_ple_window_actual_max();
> > > > > > > >
> > > > > > > > +	wakeup_handler_callback = wakeup_handler;
> > > > > > > > +
> > > > > > > >  	return 0;
> > > > > > > >
> > > > > > > >  out7:
> > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > > > index 0033df3..1551a46 100644
> > > > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > > > @@ -6152,6 +6152,21 @@ static int vcpu_enter_guest(struct
> > > kvm_vcpu
> > > > > > > *vcpu)
> > > > > > > >  			kvm_vcpu_reload_apic_access_page(vcpu);
> > > > > > > >  	}
> > > > > > > >
> > > > > > > > +	/*
> > > > > > > > +	 * Since posted-interrupts can be set by VT-d HW now, in this
> > > > > > > > +	 * case, KVM_REQ_EVENT is not set. We move the following
> > > > > > > > +	 * operations out of the if statement.
> > > > > > > > +	 */
> > > > > > > > +	if (kvm_lapic_enabled(vcpu)) {
> > > > > > > > +		/*
> > > > > > > > +		 * Update architecture specific hints for APIC
> > > > > > > > +		 * virtual interrupt delivery.
> > > > > > > > +		 */
> > > > > > > > +		if (kvm_x86_ops->hwapic_irr_update)
> > > > > > > > +			kvm_x86_ops->hwapic_irr_update(vcpu,
> > > > > > > > +				kvm_lapic_find_highest_irr(vcpu));
> > > > > > > > +	}
> > > > > > > > +
> > > > > > >
> > > > > > > This is a hot fast path. You can set KVM_REQ_EVENT from
> > > wakeup_handler.
> > > > > >
> > > > > > I am afraid Setting KVM_REQ_EVENT from wakeup_handler doesn't
> help
> > > > > much,
> > > > > > if vCPU is running in ROOT mode, and VT-d hardware issues an
> notification
> > > > > event,
> > > > > > POSTED_INTR_VECTOR interrupt handler will be called.
> > > > >
> > > > > If vCPU is in root mode, remapping HW will find IRTE configured with
> > > > > vector == POSTED_INTR_WAKEUP_VECTOR, use that vector, which will
> > > > > VM-exit, and execute the interrupt handler wakeup_handler. Right?
> > > >
> > > > There are two cases:
> > > > Case 1: vCPU is blocked, so it is in root mode, this is what you described
> > > above.
> > > > Case 2, vCPU is running in root mode, such as, handling vm-exits, in this
> case,
> > > > the notification vector is 'POSTED_INTR_VECTOR', and if external
> interrupts
> > > > from assigned devices happen, the handled of 'POSTED_INTR_VECTOR'
> will
> > > > be called ( it is 'smp_kvm_posted_intr_ipi' in fact), this routine doesn't
> need
> > > > do real things, since the pending interrupts in PIR will be synced to vIRR
> > > before
> > > > VM-Entry (this code have already been there when enabling CPU-side
> > > > posted-interrupt along with APICv). Like what I said before, it is a little
> hard
> > > to
> > > > get vCPU related information in it, even if we get, it is not accurate and
> may
> > > harm
> > > > the performance.(need search)
> > > >
> > > > So only setting KVM_REQ_EVENT in wakeup_handler cannot cover the
> > > notification
> > > > event for 'POSTED_INTR_VECTOR'.
> > > >
> > > > >
> > > > > The point of this comment is that you can keep the
> > > > >
> > > > > "if (kvm_x86_ops->hwapic_irr_update)
> > > > > 	kvm_x86_ops->hwapic_irr_update(vcpu,
> > > > > 			kvm_lapic_find_highest_irr(vcpu));
> > > > > "
> > > > >
> > > > > Code inside KVM_REQ_EVENT handling section of vcpu_run, as long as
> > > > > wakeup_handler sets KVM_REQ_EVENT.
> > > >
> > > > Please see above.
> > >
> > > OK can you set KVM_REQ_EVENT in case the ON bit is set,
> > > after disabling interrupts ?
> > >
> > Currently, the following code is executed before local_irq_disable() is called,
> > so do you mean 1)moving local_irq_disable() to the place before it. 2) after
> interrupt
> > is disabled, set KVM_REQ_EVENT in case the ON bit is set?
> 
> 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit
> is set.

Here is my understanding about your comments here:
- Disable interrupts
- Check 'ON'
- Set KVM_REQ_EVENT if 'ON' is set

Then we can put the above code inside " if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) "
just like it used to be. However, I still have some questions about this comment:

1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(), or other places?
If in vcpu_enter_guest(), since currently local_irq_disable() is called after 'KVM_REQ_EVENT'
is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() is called?
2. 'ON' is set by VT-d hardware, it can be set even when interrupt is disabled (the related bit in PIR is also set).
So does it make sense to check 'ON' and set KVM_REQ_EVENT accordingly after interrupt is disabled?

I might miss something in your comments, if so please point out. Thanks a lot!

Thanks,
Feng

> 
> >
> > "if (kvm_x86_ops->hwapic_irr_update)
> > 	kvm_x86_ops->hwapic_irr_update(vcpu,
> > 			kvm_lapic_find_highest_irr(vcpu));
> >
> > > kvm_lapic_find_highest_irr(vcpu) eats some cache
> > > (4 cachelines) versus 1 cacheline for reading ON bit.
> > >
> > > > > > > Please remove blocked and wakeup_cpu, they should not be
> necessary.
> > > > > >
> > > > > > Why do you think wakeup_cpu is not needed, when vCPU is blocked,
> > > > > > wakeup_cpu saves the cpu which the vCPU is blocked on, after vCPU
> > > > > > is woken up, it can run on a different cpu, so we need wakeup_cpu to
> > > > > > find the right list to wake up the vCPU.
> > > > >
> > > > > If the vCPU was moved it should have updated IRTE destination field
> > > > > to the pCPU which it has moved to?
> > > >
> > > > Every time a vCPU is scheduled to a new pCPU, the IRTE destination filed
> > > > would be updated accordingly.
> > > >
> > > > When vCPU is blocked. To wake up the blocked vCPU, we need to find
> which
> > > > list the vCPU is blocked on, and this is what wakeup_cpu used for?
> > >
> > > Right, perhaps prev_vcpu is a better name.
> >
> > Do you mean "prev_pcpu"?
> 
> Yes.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti March 27, 2015, 7:30 p.m. UTC | #16
On Fri, Mar 27, 2015 at 06:34:14AM +0000, Wu, Feng wrote:
> > > Currently, the following code is executed before local_irq_disable() is called,
> > > so do you mean 1)moving local_irq_disable() to the place before it. 2) after
> > interrupt
> > > is disabled, set KVM_REQ_EVENT in case the ON bit is set?
> > 
> > 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit
> > is set.
> 
> Here is my understanding about your comments here:
> - Disable interrupts
> - Check 'ON'
> - Set KVM_REQ_EVENT if 'ON' is set
> 
> Then we can put the above code inside " if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) "
> just like it used to be. However, I still have some questions about this comment:
> 
> 1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(), or other places?

See below:

> If in vcpu_enter_guest(), since currently local_irq_disable() is called after 'KVM_REQ_EVENT'
> is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() is called?

        local_irq_disable();

	*** add code here ***

        if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
						^^^^^^^^^^^^^^
            || need_resched() || signal_pending(current)) {
                vcpu->mode = OUTSIDE_GUEST_MODE;
                smp_wmb();
                local_irq_enable();
                preempt_enable();
                vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
                r = 1;
                goto cancel_injection;
        }

> 2. 'ON' is set by VT-d hardware, it can be set even when interrupt is disabled (the related bit in PIR is also set).

Yes, we are checking if the HW has set an interrupt in PIR while
outside VM (which requires PIR->VIRR transfer by software).

If the interrupt it set by hardware after local_irq_disable(), 
VMX-entry will handle the interrupt and perform the PIR->VIRR
transfer and reevaluate interrupts, injecting to guest 
if necessary, is that correct ?

> So does it make sense to check 'ON' and set KVM_REQ_EVENT accordingly after interrupt is disabled?

To replace the costly 

+            */
+           if (kvm_x86_ops->hwapic_irr_update)
+                   kvm_x86_ops->hwapic_irr_update(vcpu,
+                           kvm_lapic_find_highest_irr(vcpu));

Yes, i think so.

> I might miss something in your comments, if so please point out. Thanks a lot!
> 
> Thanks,
> Feng
> 
> > 
> > >
> > > "if (kvm_x86_ops->hwapic_irr_update)
> > > 	kvm_x86_ops->hwapic_irr_update(vcpu,
> > > 			kvm_lapic_find_highest_irr(vcpu));
> > >
> > > > kvm_lapic_find_highest_irr(vcpu) eats some cache
> > > > (4 cachelines) versus 1 cacheline for reading ON bit.
> > > >
> > > > > > > > Please remove blocked and wakeup_cpu, they should not be
> > necessary.
> > > > > > >
> > > > > > > Why do you think wakeup_cpu is not needed, when vCPU is blocked,
> > > > > > > wakeup_cpu saves the cpu which the vCPU is blocked on, after vCPU
> > > > > > > is woken up, it can run on a different cpu, so we need wakeup_cpu to
> > > > > > > find the right list to wake up the vCPU.
> > > > > >
> > > > > > If the vCPU was moved it should have updated IRTE destination field
> > > > > > to the pCPU which it has moved to?
> > > > >
> > > > > Every time a vCPU is scheduled to a new pCPU, the IRTE destination filed
> > > > > would be updated accordingly.
> > > > >
> > > > > When vCPU is blocked. To wake up the blocked vCPU, we need to find
> > which
> > > > > list the vCPU is blocked on, and this is what wakeup_cpu used for?
> > > >
> > > > Right, perhaps prev_vcpu is a better name.
> > >
> > > Do you mean "prev_pcpu"?
> > 
> > Yes.
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu, Feng March 30, 2015, 4:46 a.m. UTC | #17
> -----Original Message-----
> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Saturday, March 28, 2015 3:30 AM
> To: Wu, Feng
> Cc: hpa@zytor.com; tglx@linutronix.de; mingo@redhat.com; x86@kernel.org;
> gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
> 
> On Fri, Mar 27, 2015 at 06:34:14AM +0000, Wu, Feng wrote:
> > > > Currently, the following code is executed before local_irq_disable() is
> called,
> > > > so do you mean 1)moving local_irq_disable() to the place before it. 2) after
> > > interrupt
> > > > is disabled, set KVM_REQ_EVENT in case the ON bit is set?
> > >
> > > 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit
> > > is set.
> >
> > Here is my understanding about your comments here:
> > - Disable interrupts
> > - Check 'ON'
> > - Set KVM_REQ_EVENT if 'ON' is set
> >
> > Then we can put the above code inside " if
> (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) "
> > just like it used to be. However, I still have some questions about this
> comment:
> >
> > 1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(), or
> other places?
> 
> See below:
> 
> > If in vcpu_enter_guest(), since currently local_irq_disable() is called after
> 'KVM_REQ_EVENT'
> > is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() is
> called?
> 
>         local_irq_disable();
> 
> 	*** add code here ***

So we need add code like the following here, right?

          if ('ON' is set)
              kvm_make_request(KVM_REQ_EVENT, vcpu);

> 
>         if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
> 						^^^^^^^^^^^^^^
>             || need_resched() || signal_pending(current)) {
>                 vcpu->mode = OUTSIDE_GUEST_MODE;
>                 smp_wmb();
>                 local_irq_enable();
>                 preempt_enable();
>                 vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
>                 r = 1;
>                 goto cancel_injection;
>         }
> 
> > 2. 'ON' is set by VT-d hardware, it can be set even when interrupt is disabled
> (the related bit in PIR is also set).
> 
> Yes, we are checking if the HW has set an interrupt in PIR while
> outside VM (which requires PIR->VIRR transfer by software).
> 
> If the interrupt it set by hardware after local_irq_disable(),
> VMX-entry will handle the interrupt and perform the PIR->VIRR
> transfer and reevaluate interrupts, injecting to guest
> if necessary, is that correct ?
> 
> > So does it make sense to check 'ON' and set KVM_REQ_EVENT accordingly
> after interrupt is disabled?
> 
> To replace the costly
> 
> +            */
> +           if (kvm_x86_ops->hwapic_irr_update)
> +                   kvm_x86_ops->hwapic_irr_update(vcpu,
> +                           kvm_lapic_find_highest_irr(vcpu));
> 
> Yes, i think so.

After adding the "checking ON and setting KVM_REQ_EVENT" operations listed in my
comments above, do you mean we still need to keep the costly code above
inside "if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {}" in function
vcpu_enter_guest() as it used to be? If yes, my question is what is the exact purpose
of "checking ON and setting KVM_REQ_EVENT" operations? Here is the code flow in
vcpu_enter_guest():

1. Check KVM_REQ_EVENT, if it is set, sync pir->virr
2. Disable interrupts
3. Check ON and set KVM_REQ_EVENT -- Here, we set KVM_REQ_EVENT, but it is
checked in the step 1, which means, we cannot get any benefits even we set it here,
since the "pir->virr" sync operation was done in step 1, between step 3 and VM-Entry,
we don't synchronize the pir to virr. So even we set KVM_REQ_EVENT here, the interrupts
remaining in PIR cannot be delivered to guest during this VM-Entry, right?

Thanks,
Feng

> 
> > I might miss something in your comments, if so please point out. Thanks a
> lot!
> >
> > Thanks,
> > Feng
> >
> > >
> > > >
> > > > "if (kvm_x86_ops->hwapic_irr_update)
> > > > 	kvm_x86_ops->hwapic_irr_update(vcpu,
> > > > 			kvm_lapic_find_highest_irr(vcpu));
> > > >
> > > > > kvm_lapic_find_highest_irr(vcpu) eats some cache
> > > > > (4 cachelines) versus 1 cacheline for reading ON bit.
> > > > >
> > > > > > > > > Please remove blocked and wakeup_cpu, they should not be
> > > necessary.
> > > > > > > >
> > > > > > > > Why do you think wakeup_cpu is not needed, when vCPU is
> blocked,
> > > > > > > > wakeup_cpu saves the cpu which the vCPU is blocked on, after
> vCPU
> > > > > > > > is woken up, it can run on a different cpu, so we need wakeup_cpu
> to
> > > > > > > > find the right list to wake up the vCPU.
> > > > > > >
> > > > > > > If the vCPU was moved it should have updated IRTE destination field
> > > > > > > to the pCPU which it has moved to?
> > > > > >
> > > > > > Every time a vCPU is scheduled to a new pCPU, the IRTE destination
> filed
> > > > > > would be updated accordingly.
> > > > > >
> > > > > > When vCPU is blocked. To wake up the blocked vCPU, we need to find
> > > which
> > > > > > list the vCPU is blocked on, and this is what wakeup_cpu used for?
> > > > >
> > > > > Right, perhaps prev_vcpu is a better name.
> > > >
> > > > Do you mean "prev_pcpu"?
> > >
> > > Yes.
> > >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti March 30, 2015, 11:55 p.m. UTC | #18
On Mon, Mar 30, 2015 at 04:46:55AM +0000, Wu, Feng wrote:
> 
> 
> > -----Original Message-----
> > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > Sent: Saturday, March 28, 2015 3:30 AM
> > To: Wu, Feng
> > Cc: hpa@zytor.com; tglx@linutronix.de; mingo@redhat.com; x86@kernel.org;
> > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> > is blocked
> > 
> > On Fri, Mar 27, 2015 at 06:34:14AM +0000, Wu, Feng wrote:
> > > > > Currently, the following code is executed before local_irq_disable() is
> > called,
> > > > > so do you mean 1)moving local_irq_disable() to the place before it. 2) after
> > > > interrupt
> > > > > is disabled, set KVM_REQ_EVENT in case the ON bit is set?
> > > >
> > > > 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit
> > > > is set.
> > >
> > > Here is my understanding about your comments here:
> > > - Disable interrupts
> > > - Check 'ON'
> > > - Set KVM_REQ_EVENT if 'ON' is set
> > >
> > > Then we can put the above code inside " if
> > (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) "
> > > just like it used to be. However, I still have some questions about this
> > comment:
> > >
> > > 1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(), or
> > other places?
> > 
> > See below:
> > 
> > > If in vcpu_enter_guest(), since currently local_irq_disable() is called after
> > 'KVM_REQ_EVENT'
> > > is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() is
> > called?
> > 
> >         local_irq_disable();
> > 
> > 	*** add code here ***
> 
> So we need add code like the following here, right?
> 
>           if ('ON' is set)
>               kvm_make_request(KVM_REQ_EVENT, vcpu);

Yes.

> >         if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
> > 						^^^^^^^^^^^^^^

Point *1.

> >             || need_resched() || signal_pending(current)) {
> >                 vcpu->mode = OUTSIDE_GUEST_MODE;
> >                 smp_wmb();
> >                 local_irq_enable();
> >                 preempt_enable();
> >                 vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> >                 r = 1;
> >                 goto cancel_injection;
> >         }
> > 
> > > 2. 'ON' is set by VT-d hardware, it can be set even when interrupt is disabled
> > (the related bit in PIR is also set).
> > 
> > Yes, we are checking if the HW has set an interrupt in PIR while
> > outside VM (which requires PIR->VIRR transfer by software).
> > 
> > If the interrupt it set by hardware after local_irq_disable(),
> > VMX-entry will handle the interrupt and perform the PIR->VIRR
> > transfer and reevaluate interrupts, injecting to guest
> > if necessary, is that correct ?
> > 
> > > So does it make sense to check 'ON' and set KVM_REQ_EVENT accordingly
> > after interrupt is disabled?
> > 
> > To replace the costly
> > 
> > +            */
> > +           if (kvm_x86_ops->hwapic_irr_update)
> > +                   kvm_x86_ops->hwapic_irr_update(vcpu,
> > +                           kvm_lapic_find_highest_irr(vcpu));
> > 
> > Yes, i think so.
> 
> After adding the "checking ON and setting KVM_REQ_EVENT" operations listed in my
> comments above, do you mean we still need to keep the costly code above
> inside "if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {}" in function
> vcpu_enter_guest() as it used to be? If yes, my question is what is the exact purpose
> of "checking ON and setting KVM_REQ_EVENT" operations? Here is the code flow in
> vcpu_enter_guest():
> 
> 1. Check KVM_REQ_EVENT, if it is set, sync pir->virr
> 2. Disable interrupts
> 3. Check ON and set KVM_REQ_EVENT -- Here, we set KVM_REQ_EVENT, but it is
> checked in the step 1, which means, we cannot get any benefits even we set it here,
> since the "pir->virr" sync operation was done in step 1, between step 3 and VM-Entry,
> we don't synchronize the pir to virr. So even we set KVM_REQ_EVENT here, the interrupts
> remaining in PIR cannot be delivered to guest during this VM-Entry, right?

Please check point *1 above. The code will go back to  

"if (kvm_check_request(KVM_REQ_EVENT, vcpu)"

And perform the pir->virr sync.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu, Feng March 31, 2015, 1:13 a.m. UTC | #19
> -----Original Message-----
> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Tuesday, March 31, 2015 7:56 AM
> To: Wu, Feng
> Cc: hpa@zytor.com; tglx@linutronix.de; mingo@redhat.com; x86@kernel.org;
> gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU is
> blocked
> 
> On Mon, Mar 30, 2015 at 04:46:55AM +0000, Wu, Feng wrote:
> >
> >
> > > -----Original Message-----
> > > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > > Sent: Saturday, March 28, 2015 3:30 AM
> > > To: Wu, Feng
> > > Cc: hpa@zytor.com; tglx@linutronix.de; mingo@redhat.com;
> x86@kernel.org;
> > > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > > joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> > > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
> vCPU
> > > is blocked
> > >
> > > On Fri, Mar 27, 2015 at 06:34:14AM +0000, Wu, Feng wrote:
> > > > > > Currently, the following code is executed before local_irq_disable() is
> > > called,
> > > > > > so do you mean 1)moving local_irq_disable() to the place before it. 2)
> after
> > > > > interrupt
> > > > > > is disabled, set KVM_REQ_EVENT in case the ON bit is set?
> > > > >
> > > > > 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit
> > > > > is set.
> > > >
> > > > Here is my understanding about your comments here:
> > > > - Disable interrupts
> > > > - Check 'ON'
> > > > - Set KVM_REQ_EVENT if 'ON' is set
> > > >
> > > > Then we can put the above code inside " if
> > > (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) "
> > > > just like it used to be. However, I still have some questions about this
> > > comment:
> > > >
> > > > 1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(),
> or
> > > other places?
> > >
> > > See below:
> > >
> > > > If in vcpu_enter_guest(), since currently local_irq_disable() is called after
> > > 'KVM_REQ_EVENT'
> > > > is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() is
> > > called?
> > >
> > >         local_irq_disable();
> > >
> > > 	*** add code here ***
> >
> > So we need add code like the following here, right?
> >
> >           if ('ON' is set)
> >               kvm_make_request(KVM_REQ_EVENT, vcpu);
> 
> Yes.
> 
> > >         if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
> > > 						^^^^^^^^^^^^^^
> 
> Point *1.
> 
> > >             || need_resched() || signal_pending(current)) {
> > >                 vcpu->mode = OUTSIDE_GUEST_MODE;
> > >                 smp_wmb();
> > >                 local_irq_enable();
> > >                 preempt_enable();
> > >                 vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> > >                 r = 1;
> > >                 goto cancel_injection;
> > >         }
> > >
> > > > 2. 'ON' is set by VT-d hardware, it can be set even when interrupt is
> disabled
> > > (the related bit in PIR is also set).
> > >
> > > Yes, we are checking if the HW has set an interrupt in PIR while
> > > outside VM (which requires PIR->VIRR transfer by software).
> > >
> > > If the interrupt it set by hardware after local_irq_disable(),
> > > VMX-entry will handle the interrupt and perform the PIR->VIRR
> > > transfer and reevaluate interrupts, injecting to guest
> > > if necessary, is that correct ?
> > >
> > > > So does it make sense to check 'ON' and set KVM_REQ_EVENT accordingly
> > > after interrupt is disabled?
> > >
> > > To replace the costly
> > >
> > > +            */
> > > +           if (kvm_x86_ops->hwapic_irr_update)
> > > +                   kvm_x86_ops->hwapic_irr_update(vcpu,
> > > +                           kvm_lapic_find_highest_irr(vcpu));
> > >
> > > Yes, i think so.
> >
> > After adding the "checking ON and setting KVM_REQ_EVENT" operations
> listed in my
> > comments above, do you mean we still need to keep the costly code above
> > inside "if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {}" in
> function
> > vcpu_enter_guest() as it used to be? If yes, my question is what is the exact
> purpose
> > of "checking ON and setting KVM_REQ_EVENT" operations? Here is the code
> flow in
> > vcpu_enter_guest():
> >
> > 1. Check KVM_REQ_EVENT, if it is set, sync pir->virr
> > 2. Disable interrupts
> > 3. Check ON and set KVM_REQ_EVENT -- Here, we set KVM_REQ_EVENT, but
> it is
> > checked in the step 1, which means, we cannot get any benefits even we set it
> here,
> > since the "pir->virr" sync operation was done in step 1, between step 3 and
> VM-Entry,
> > we don't synchronize the pir to virr. So even we set KVM_REQ_EVENT here,
> the interrupts
> > remaining in PIR cannot be delivered to guest during this VM-Entry, right?
> 
> Please check point *1 above. The code will go back to
> 
> "if (kvm_check_request(KVM_REQ_EVENT, vcpu)"
> 
> And perform the pir->virr sync.

Ah, yes, that is the point I was missing. Thanks for pointing this out!

Thanks,
Feng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu, Feng April 14, 2015, 7:37 a.m. UTC | #20
> -----Original Message-----
> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Tuesday, March 31, 2015 7:56 AM
> To: Wu, Feng
> Cc: hpa@zytor.com; tglx@linutronix.de; mingo@redhat.com; x86@kernel.org;
> gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
> 
> On Mon, Mar 30, 2015 at 04:46:55AM +0000, Wu, Feng wrote:
> >
> >
> > > -----Original Message-----
> > > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > > Sent: Saturday, March 28, 2015 3:30 AM
> > > To: Wu, Feng
> > > Cc: hpa@zytor.com; tglx@linutronix.de; mingo@redhat.com;
> x86@kernel.org;
> > > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > > joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> > > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
> vCPU
> > > is blocked
> > >
> > > On Fri, Mar 27, 2015 at 06:34:14AM +0000, Wu, Feng wrote:
> > > > > > Currently, the following code is executed before local_irq_disable() is
> > > called,
> > > > > > so do you mean 1)moving local_irq_disable() to the place before it. 2)
> after
> > > > > interrupt
> > > > > > is disabled, set KVM_REQ_EVENT in case the ON bit is set?
> > > > >
> > > > > 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit
> > > > > is set.
> > > >
> > > > Here is my understanding about your comments here:
> > > > - Disable interrupts
> > > > - Check 'ON'
> > > > - Set KVM_REQ_EVENT if 'ON' is set
> > > >
> > > > Then we can put the above code inside " if
> > > (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) "
> > > > just like it used to be. However, I still have some questions about this
> > > comment:
> > > >
> > > > 1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(),
> or
> > > other places?
> > >
> > > See below:
> > >
> > > > If in vcpu_enter_guest(), since currently local_irq_disable() is called after
> > > 'KVM_REQ_EVENT'
> > > > is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() is
> > > called?
> > >
> > >         local_irq_disable();
> > >
> > > 	*** add code here ***
> >
> > So we need add code like the following here, right?
> >
> >           if ('ON' is set)
> >               kvm_make_request(KVM_REQ_EVENT, vcpu);
> 

Hi Marcelo,

I changed the code as above, then I found that the ping latency was extremely big, (70ms - 400ms).
I digged into it and got the root cause. We cannot use "checking-on" as the judgment, since 'ON'
can be cleared by hypervisor software in lots of places. In this case, KVM_REQ_EVENT cannot be
set when we check 'ON' bit, hence the interrupts are not injected to the guest in time.

Please refer to the following code, in which 'ON' bit can be cleared:

apic_find_highest_irr () --> vmx_sync_pir_to_irr () --> pi_test_and_clear_on()

Searching from the code step by step, apic_find_highest_irr() can be called by many other guys.

Thanks,
Feng

> Yes.
> 
> > >         if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
> > > 						^^^^^^^^^^^^^^
> 
> Point *1.
> 
> > >             || need_resched() || signal_pending(current)) {
> > >                 vcpu->mode = OUTSIDE_GUEST_MODE;
> > >                 smp_wmb();
> > >                 local_irq_enable();
> > >                 preempt_enable();
> > >                 vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> > >                 r = 1;
> > >                 goto cancel_injection;
> > >         }
> > >
> > > > 2. 'ON' is set by VT-d hardware, it can be set even when interrupt is
> disabled
> > > (the related bit in PIR is also set).
> > >
> > > Yes, we are checking if the HW has set an interrupt in PIR while
> > > outside VM (which requires PIR->VIRR transfer by software).
> > >
> > > If the interrupt it set by hardware after local_irq_disable(),
> > > VMX-entry will handle the interrupt and perform the PIR->VIRR
> > > transfer and reevaluate interrupts, injecting to guest
> > > if necessary, is that correct ?
> > >
> > > > So does it make sense to check 'ON' and set KVM_REQ_EVENT accordingly
> > > after interrupt is disabled?
> > >
> > > To replace the costly
> > >
> > > +            */
> > > +           if (kvm_x86_ops->hwapic_irr_update)
> > > +                   kvm_x86_ops->hwapic_irr_update(vcpu,
> > > +                           kvm_lapic_find_highest_irr(vcpu));
> > >
> > > Yes, i think so.
> >
> > After adding the "checking ON and setting KVM_REQ_EVENT" operations
> listed in my
> > comments above, do you mean we still need to keep the costly code above
> > inside "if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {}" in
> function
> > vcpu_enter_guest() as it used to be? If yes, my question is what is the exact
> purpose
> > of "checking ON and setting KVM_REQ_EVENT" operations? Here is the code
> flow in
> > vcpu_enter_guest():
> >
> > 1. Check KVM_REQ_EVENT, if it is set, sync pir->virr
> > 2. Disable interrupts
> > 3. Check ON and set KVM_REQ_EVENT -- Here, we set KVM_REQ_EVENT, but
> it is
> > checked in the step 1, which means, we cannot get any benefits even we set
> it here,
> > since the "pir->virr" sync operation was done in step 1, between step 3 and
> VM-Entry,
> > we don't synchronize the pir to virr. So even we set KVM_REQ_EVENT here,
> the interrupts
> > remaining in PIR cannot be delivered to guest during this VM-Entry, right?
> 
> Please check point *1 above. The code will go back to
> 
> "if (kvm_check_request(KVM_REQ_EVENT, vcpu)"
> 
> And perform the pir->virr sync.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti June 5, 2015, 9:59 p.m. UTC | #21
On Tue, Apr 14, 2015 at 07:37:44AM +0000, Wu, Feng wrote:
> 
> 
> > -----Original Message-----
> > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > Sent: Tuesday, March 31, 2015 7:56 AM
> > To: Wu, Feng
> > Cc: hpa@zytor.com; tglx@linutronix.de; mingo@redhat.com; x86@kernel.org;
> > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> > is blocked
> > 
> > On Mon, Mar 30, 2015 at 04:46:55AM +0000, Wu, Feng wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > > > Sent: Saturday, March 28, 2015 3:30 AM
> > > > To: Wu, Feng
> > > > Cc: hpa@zytor.com; tglx@linutronix.de; mingo@redhat.com;
> > x86@kernel.org;
> > > > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > > > joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> > > > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > > > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > > > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
> > vCPU
> > > > is blocked
> > > >
> > > > On Fri, Mar 27, 2015 at 06:34:14AM +0000, Wu, Feng wrote:
> > > > > > > Currently, the following code is executed before local_irq_disable() is
> > > > called,
> > > > > > > so do you mean 1)moving local_irq_disable() to the place before it. 2)
> > after
> > > > > > interrupt
> > > > > > > is disabled, set KVM_REQ_EVENT in case the ON bit is set?
> > > > > >
> > > > > > 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit
> > > > > > is set.
> > > > >
> > > > > Here is my understanding about your comments here:
> > > > > - Disable interrupts
> > > > > - Check 'ON'
> > > > > - Set KVM_REQ_EVENT if 'ON' is set
> > > > >
> > > > > Then we can put the above code inside " if
> > > > (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) "
> > > > > just like it used to be. However, I still have some questions about this
> > > > comment:
> > > > >
> > > > > 1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(),
> > or
> > > > other places?
> > > >
> > > > See below:
> > > >
> > > > > If in vcpu_enter_guest(), since currently local_irq_disable() is called after
> > > > 'KVM_REQ_EVENT'
> > > > > is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() is
> > > > called?
> > > >
> > > >         local_irq_disable();
> > > >
> > > > 	*** add code here ***
> > >
> > > So we need add code like the following here, right?
> > >
> > >           if ('ON' is set)
> > >               kvm_make_request(KVM_REQ_EVENT, vcpu);
> > 
> 
> Hi Marcelo,
> 
> I changed the code as above, then I found that the ping latency was extremely big, (70ms - 400ms).
> I digged into it and got the root cause. We cannot use "checking-on" as the judgment, since 'ON'
> can be cleared by hypervisor software in lots of places. In this case, KVM_REQ_EVENT cannot be
> set when we check 'ON' bit, hence the interrupts are not injected to the guest in time.
> 
> Please refer to the following code, in which 'ON' bit can be cleared:
> 
> apic_find_highest_irr () --> vmx_sync_pir_to_irr () --> pi_test_and_clear_on()
> 
> Searching from the code step by step, apic_find_highest_irr() can be called by many other guys.
> 
> Thanks,

Ok then, ignore my suggestion.

Can you resend the latest version please ?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wu, Feng June 8, 2015, 1:43 a.m. UTC | #22
> -----Original Message-----
> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Saturday, June 06, 2015 5:59 AM
> To: Wu, Feng
> Cc: hpa@zytor.com; tglx@linutronix.de; mingo@redhat.com; x86@kernel.org;
> gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
> 
> On Tue, Apr 14, 2015 at 07:37:44AM +0000, Wu, Feng wrote:
> >
> >
> > > -----Original Message-----
> > > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > > Sent: Tuesday, March 31, 2015 7:56 AM
> > > To: Wu, Feng
> > > Cc: hpa@zytor.com; tglx@linutronix.de; mingo@redhat.com;
> x86@kernel.org;
> > > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > > joro@8bytes.org; alex.williamson@redhat.com; jiang.liu@linux.intel.com;
> > > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
> vCPU
> > > is blocked
> > >
> > > On Mon, Mar 30, 2015 at 04:46:55AM +0000, Wu, Feng wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > > > > Sent: Saturday, March 28, 2015 3:30 AM
> > > > > To: Wu, Feng
> > > > > Cc: hpa@zytor.com; tglx@linutronix.de; mingo@redhat.com;
> > > x86@kernel.org;
> > > > > gleb@kernel.org; pbonzini@redhat.com; dwmw2@infradead.org;
> > > > > joro@8bytes.org; alex.williamson@redhat.com;
> jiang.liu@linux.intel.com;
> > > > > eric.auger@linaro.org; linux-kernel@vger.kernel.org;
> > > > > iommu@lists.linux-foundation.org; kvm@vger.kernel.org
> > > > > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
> > > vCPU
> > > > > is blocked
> > > > >
> > > > > On Fri, Mar 27, 2015 at 06:34:14AM +0000, Wu, Feng wrote:
> > > > > > > > Currently, the following code is executed before local_irq_disable()
> is
> > > > > called,
> > > > > > > > so do you mean 1)moving local_irq_disable() to the place before it.
> 2)
> > > after
> > > > > > > interrupt
> > > > > > > > is disabled, set KVM_REQ_EVENT in case the ON bit is set?
> > > > > > >
> > > > > > > 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON
> bit
> > > > > > > is set.
> > > > > >
> > > > > > Here is my understanding about your comments here:
> > > > > > - Disable interrupts
> > > > > > - Check 'ON'
> > > > > > - Set KVM_REQ_EVENT if 'ON' is set
> > > > > >
> > > > > > Then we can put the above code inside " if
> > > > > (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) "
> > > > > > just like it used to be. However, I still have some questions about this
> > > > > comment:
> > > > > >
> > > > > > 1. Where should I set KVM_REQ_EVENT? In function
> vcpu_enter_guest(),
> > > or
> > > > > other places?
> > > > >
> > > > > See below:
> > > > >
> > > > > > If in vcpu_enter_guest(), since currently local_irq_disable() is called
> after
> > > > > 'KVM_REQ_EVENT'
> > > > > > is checked, is it helpful to set KVM_REQ_EVENT after
> local_irq_disable() is
> > > > > called?
> > > > >
> > > > >         local_irq_disable();
> > > > >
> > > > > 	*** add code here ***
> > > >
> > > > So we need add code like the following here, right?
> > > >
> > > >           if ('ON' is set)
> > > >               kvm_make_request(KVM_REQ_EVENT, vcpu);
> > >
> >
> > Hi Marcelo,
> >
> > I changed the code as above, then I found that the ping latency was
> extremely big, (70ms - 400ms).
> > I digged into it and got the root cause. We cannot use "checking-on" as the
> judgment, since 'ON'
> > can be cleared by hypervisor software in lots of places. In this case,
> KVM_REQ_EVENT cannot be
> > set when we check 'ON' bit, hence the interrupts are not injected to the guest
> in time.
> >
> > Please refer to the following code, in which 'ON' bit can be cleared:
> >
> > apic_find_highest_irr () --> vmx_sync_pir_to_irr () --> pi_test_and_clear_on()
> >
> > Searching from the code step by step, apic_find_highest_irr() can be called by
> many other guys.
> >
> > Thanks,
> 
> Ok then, ignore my suggestion.
> 
> Can you resend the latest version please ?

Thanks for your review, I will send the new version soon.

Thanks,
Feng

> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 13e3e40..32c110a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -101,6 +101,8 @@  static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
 
 #define ASYNC_PF_PER_VCPU 64
 
+extern void (*wakeup_handler_callback)(void);
+
 enum kvm_reg {
 	VCPU_REGS_RAX = 0,
 	VCPU_REGS_RCX = 1,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bf2e6cd..a1c83a2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -832,6 +832,13 @@  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
 static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
 static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
 
+/*
+ * We maintian a per-CPU linked-list of vCPU, so in wakeup_handler() we
+ * can find which vCPU should be waken up.
+ */
+static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
+static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
+
 static unsigned long *vmx_io_bitmap_a;
 static unsigned long *vmx_io_bitmap_b;
 static unsigned long *vmx_msr_bitmap_legacy;
@@ -1921,6 +1928,7 @@  static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 		struct pi_desc old, new;
 		unsigned int dest;
+		unsigned long flags;
 
 		memset(&old, 0, sizeof(old));
 		memset(&new, 0, sizeof(new));
@@ -1942,6 +1950,20 @@  static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 			new.nv = POSTED_INTR_VECTOR;
 		} while (cmpxchg(&pi_desc->control, old.control,
 				new.control) != old.control);
+
+		/*
+		 * Delete the vCPU from the related wakeup queue
+		 * if we are resuming from blocked state
+		 */
+		if (vcpu->blocked) {
+			vcpu->blocked = false;
+			spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
+				vcpu->wakeup_cpu), flags);
+			list_del(&vcpu->blocked_vcpu_list);
+			spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
+				vcpu->wakeup_cpu), flags);
+			vcpu->wakeup_cpu = -1;
+		}
 	}
 }
 
@@ -1950,6 +1972,9 @@  static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
 	if (irq_remapping_cap(IRQ_POSTING_CAP)) {
 		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 		struct pi_desc old, new;
+		unsigned long flags;
+		int cpu;
+		struct cpumask cpu_others_mask;
 
 		memset(&old, 0, sizeof(old));
 		memset(&new, 0, sizeof(new));
@@ -1961,6 +1986,54 @@  static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
 				pi_set_sn(&new);
 			} while (cmpxchg(&pi_desc->control, old.control,
 					new.control) != old.control);
+		} else if (vcpu->blocked) {
+			/*
+			 * The vcpu is blocked on the wait queue.
+			 * Store the blocked vCPU on the list of the
+			 * vcpu->wakeup_cpu, which is the destination
+			 * of the wake-up notification event.
+			 */
+			vcpu->wakeup_cpu = vcpu->cpu;
+			spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
+					  vcpu->wakeup_cpu), flags);
+			list_add_tail(&vcpu->blocked_vcpu_list,
+				      &per_cpu(blocked_vcpu_on_cpu,
+				      vcpu->wakeup_cpu));
+			spin_unlock_irqrestore(
+					&per_cpu(blocked_vcpu_on_cpu_lock,
+					vcpu->wakeup_cpu), flags);
+
+			do {
+				old.control = new.control = pi_desc->control;
+
+				/*
+				 * We should not block the vCPU if
+				 * an interrupt is posted for it.
+				 */
+				if (pi_test_on(pi_desc) == 1) {
+					/*
+					 * We need schedule the wakeup worker
+					 * on a different cpu other than
+					 * vcpu->cpu, because in some case,
+					 * schedule_work() will call
+					 * try_to_wake_up() which needs acquire
+					 * the rq lock. This can cause deadlock.
+					 */
+					cpumask_copy(&cpu_others_mask,
+						     cpu_online_mask);
+					cpu_clear(vcpu->cpu, cpu_others_mask);
+					cpu = any_online_cpu(cpu_others_mask);
+
+					schedule_work_on(cpu,
+							 &vcpu->wakeup_worker);
+				}
+
+				pi_clear_sn(&new);
+
+				/* set 'NV' to 'wakeup vector' */
+				new.nv = POSTED_INTR_WAKEUP_VECTOR;
+			} while (cmpxchg(&pi_desc->control, old.control,
+				new.control) != old.control);
 		}
 	}
 
@@ -2842,6 +2915,8 @@  static int hardware_enable(void)
 		return -EBUSY;
 
 	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
+	INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
+	spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
 
 	/*
 	 * Now we can enable the vmclear operation in kdump
@@ -9315,6 +9390,25 @@  static struct kvm_x86_ops vmx_x86_ops = {
 	.pi_set_sn = vmx_pi_set_sn,
 };
 
+/*
+ * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
+ */
+void wakeup_handler(void)
+{
+	struct kvm_vcpu *vcpu;
+	int cpu = smp_processor_id();
+
+	spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+	list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
+			blocked_vcpu_list) {
+		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+
+		if (pi_test_on(pi_desc) == 1)
+			kvm_vcpu_kick(vcpu);
+	}
+	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+}
+
 static int __init vmx_init(void)
 {
 	int r, i, msr;
@@ -9429,6 +9523,8 @@  static int __init vmx_init(void)
 
 	update_ple_window_actual_max();
 
+	wakeup_handler_callback = wakeup_handler;
+
 	return 0;
 
 out7:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0033df3..1551a46 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6152,6 +6152,21 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_vcpu_reload_apic_access_page(vcpu);
 	}
 
+	/*
+	 * Since posted-interrupts can be set by VT-d HW now, in this
+	 * case, KVM_REQ_EVENT is not set. We move the following
+	 * operations out of the if statement.
+	 */
+	if (kvm_lapic_enabled(vcpu)) {
+		/*
+		 * Update architecture specific hints for APIC
+		 * virtual interrupt delivery.
+		 */
+		if (kvm_x86_ops->hwapic_irr_update)
+			kvm_x86_ops->hwapic_irr_update(vcpu,
+				kvm_lapic_find_highest_irr(vcpu));
+	}
+
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
 		kvm_apic_accept_events(vcpu);
 		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
@@ -6168,13 +6183,6 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_x86_ops->enable_irq_window(vcpu);
 
 		if (kvm_lapic_enabled(vcpu)) {
-			/*
-			 * Update architecture specific hints for APIC
-			 * virtual interrupt delivery.
-			 */
-			if (kvm_x86_ops->hwapic_irr_update)
-				kvm_x86_ops->hwapic_irr_update(vcpu,
-					kvm_lapic_find_highest_irr(vcpu));
 			update_cr8_intercept(vcpu);
 			kvm_lapic_sync_to_vapic(vcpu);
 		}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3d7242c..d981d16 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -239,6 +239,9 @@  struct kvm_vcpu {
 	unsigned long requests;
 	unsigned long guest_debug;
 
+	int wakeup_cpu;
+	struct list_head blocked_vcpu_list;
+
 	struct mutex mutex;
 	struct kvm_run *run;
 
@@ -282,6 +285,7 @@  struct kvm_vcpu {
 	} spin_loop;
 #endif
 	bool preempted;
+	bool blocked;
 	struct kvm_vcpu_arch arch;
 };
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ba53fd6..6deb994 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -233,6 +233,9 @@  int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 
 	INIT_WORK(&vcpu->wakeup_worker, wakeup_thread);
 
+	vcpu->wakeup_cpu = -1;
+	INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
+
 	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!page) {
 		r = -ENOMEM;
@@ -243,6 +246,7 @@  int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	kvm_vcpu_set_in_spin_loop(vcpu, false);
 	kvm_vcpu_set_dy_eligible(vcpu, false);
 	vcpu->preempted = false;
+	vcpu->blocked = false;
 
 	r = kvm_arch_vcpu_init(vcpu);
 	if (r < 0)
@@ -1752,6 +1756,7 @@  void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 	DEFINE_WAIT(wait);
 
 	for (;;) {
+		vcpu->blocked = true;
 		prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
 
 		if (kvm_arch_vcpu_runnable(vcpu)) {
@@ -1767,6 +1772,7 @@  void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 	}
 
 	finish_wait(&vcpu->wq, &wait);
+	vcpu->blocked = false;
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_block);