diff mbox

[v4,3/4] VT-d PI: restrict the vcpu number on a given pcpu

Message ID 1499410140-8003-4-git-send-email-chao.gao@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Chao Gao July 7, 2017, 6:48 a.m. UTC
Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
too many vCPUs are blocked on a given pCPU, it will incur that the list
grows too long. After a simple analysis, there are 32k domains and
128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
PI blocking list. When a wakeup interrupt arrives, the list is
traversed to find some specific vCPUs to wake them up. This traversal in
that case would consume much time.

To mitigate this issue, this patch limits the number of vCPUs tracked on a
given pCPU's blocking list, taking factors such as perfomance of common case,
current hvm vCPU count and current pCPU count into consideration. With this
method, for the common case, it works fast and for some extreme cases, the
list length is under control.

With this patch, when a vcpu is to be blocked, we check whether the pi
blocking list's length of the pcpu where the vcpu is running exceeds
the limit which is the average vcpus per pcpu ratio plus a constant.
If no, the vcpu is added to this pcpu's pi blocking list. Otherwise,
another online pcpu is chosen to accept the vcpu.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
v4:
 - use a new lock to avoid adding a blocked vcpu to a offline pcpu's blocking
 list.

---
 xen/arch/x86/hvm/vmx/vmx.c | 136 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 114 insertions(+), 22 deletions(-)

Comments

Jan Beulich July 7, 2017, 3:57 p.m. UTC | #1
>>> On 07.07.17 at 08:48, <chao.gao@intel.com> wrote:
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -95,22 +95,91 @@ static DEFINE_PER_CPU(struct vmx_pi_blocking_vcpu, 
> vmx_pi_blocking);
>  uint8_t __read_mostly posted_intr_vector;
>  static uint8_t __read_mostly pi_wakeup_vector;
>  
> +/*
> + * Protect critical sections to avoid adding a blocked vcpu to a destroyed
> + * blocking list.
> + */
> +static DEFINE_SPINLOCK(remote_pbl_operation);

What is "pbl" supposed to stand for?

> +#define remote_pbl_operation_begin(flags)                   \
> +({                                                          \
> +    spin_lock_irqsave(&remote_pbl_operation, flags);        \
> +})
> +
> +#define remote_pbl_operation_done(flags)                    \
> +({                                                          \
> +    spin_unlock_irqrestore(&remote_pbl_operation, flags);   \
> +})

No need for the ({ }) here.

But then I don't understand what this is needed for in the first
place. If this is once again about CPU offlining, then I can only
repeat that such happens in stop_machine context. Otherwise
I'm afraid the comment ahead of this code section needs
adjustment, as I can't interpret it in another way.

> +/*
> + * By default, the local pcpu (means the one the vcpu is currently running on)
> + * is chosen as the destination of wakeup interrupt. But if the vcpu number of
> + * the pcpu exceeds a limit, another pcpu is chosen until we find a suitable
> + * one.
> + *
> + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu count, where
> + * v_tot is the total number of hvm vcpus on the system, p_tot is the total
> + * number of pcpus in the system, and K is a fixed number. An experment on a
> + * skylake server which has 112 cpus and 64G memory shows the maximum time to
> + * wakeup a vcpu from a 128-entry blocking list takes about 22us, which is
> + * tolerable. So choose 128 as the fixed number K.
> + *
> + * This policy makes sure:
> + * 1) for common cases, the limit won't be reached and the local pcpu is used
> + * which is beneficial to performance (at least, avoid an IPI when unblocking
> + * vcpu).
> + * 2) for the worst case, the blocking list length scales with the vcpu count
> + * divided by the pcpu count.
> + */
> +#define PI_LIST_FIXED_NUM 128
> +#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) / num_online_cpus() + \
> +                           PI_LIST_FIXED_NUM)
> +static inline bool pi_over_limit(int cpu)

unsigned int

> +{
> +    return per_cpu(vmx_pi_blocking, cpu).counter > PI_LIST_LIMIT;
> +}
> +
>  static void vmx_vcpu_block(struct vcpu *v)
>  {
> -    unsigned long flags;
> -    unsigned int dest;
> +    unsigned long flags[2];

??? You almost never need two flags instances in a function.

> +    unsigned int dest, pi_cpu;
>      spinlock_t *old_lock;
> -    spinlock_t *pi_blocking_list_lock =
> -		&per_cpu(vmx_pi_blocking, v->processor).lock;
>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    spinlock_t *pi_blocking_list_lock;
> +    bool in_remote_operation = false;
> +
> +    pi_cpu = v->processor;
> +
> +    if ( unlikely(pi_over_limit(pi_cpu)) )
> +    {
> +        remote_pbl_operation_begin(flags[0]);
> +        in_remote_operation = true;
> +        while (true)

Coding style (and "for ( ; ; )" is generally better anyway).

Jan
Chao Gao July 10, 2017, 1:17 a.m. UTC | #2
On Fri, Jul 07, 2017 at 09:57:47AM -0600, Jan Beulich wrote:
>>>> On 07.07.17 at 08:48, <chao.gao@intel.com> wrote:
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -95,22 +95,91 @@ static DEFINE_PER_CPU(struct vmx_pi_blocking_vcpu, 
>> vmx_pi_blocking);
>>  uint8_t __read_mostly posted_intr_vector;
>>  static uint8_t __read_mostly pi_wakeup_vector;
>>  
>> +/*
>> + * Protect critical sections to avoid adding a blocked vcpu to a destroyed
>> + * blocking list.
>> + */
>> +static DEFINE_SPINLOCK(remote_pbl_operation);
>
>What is "pbl" supposed to stand for?

pi blocking list.

>
>> +#define remote_pbl_operation_begin(flags)                   \
>> +({                                                          \
>> +    spin_lock_irqsave(&remote_pbl_operation, flags);        \
>> +})
>> +
>> +#define remote_pbl_operation_done(flags)                    \
>> +({                                                          \
>> +    spin_unlock_irqrestore(&remote_pbl_operation, flags);   \
>> +})
>
>No need for the ({ }) here.
>
>But then I don't understand what this is needed for in the first
>place. If this is once again about CPU offlining, then I can only
>repeat that such happens in stop_machine context. Otherwise

But I don't think vmx_pi_desc_fixup() happens in stop_machine context,
please refer to cpu_callback() function in hvm.c and the time
notifier_call_chain(CPU_DEAD) is called in cpu_down().

Our goal here is to avoid adding one entry to a destroyed list.
To avoid destruction happens during adding, we can put these two
process in critical sections, like

add:
	remote_pbl_operation_begin()
	add one entry to the list
	remote_pbl_operation_end()

destroy:
	remote_pbl_operation_begin()
	destruction
	remote_pbl_operation_end()

Destruction may happen before we enter the critical section.
so adding should be:

add:
	remote_pbl_operation_begin()
	check the list is still valid
	add one entry to the list
	remote_pbl_operation_end()

In this patch, we choose an online cpu's list. The list should be valid
for the list is always destroyed after offline.

>I'm afraid the comment ahead of this code section needs
>adjustment, as I can't interpret it in another way.

Thanks
Chao
Jan Beulich July 10, 2017, 9:36 a.m. UTC | #3
>>> On 10.07.17 at 03:17, <chao.gao@intel.com> wrote:
> On Fri, Jul 07, 2017 at 09:57:47AM -0600, Jan Beulich wrote:
>>>>> On 07.07.17 at 08:48, <chao.gao@intel.com> wrote:
>>> +#define remote_pbl_operation_begin(flags)                   \
>>> +({                                                          \
>>> +    spin_lock_irqsave(&remote_pbl_operation, flags);        \
>>> +})
>>> +
>>> +#define remote_pbl_operation_done(flags)                    \
>>> +({                                                          \
>>> +    spin_unlock_irqrestore(&remote_pbl_operation, flags);   \
>>> +})
>>
>>No need for the ({ }) here.
>>
>>But then I don't understand what this is needed for in the first
>>place. If this is once again about CPU offlining, then I can only
>>repeat that such happens in stop_machine context. Otherwise
> 
> But I don't think vmx_pi_desc_fixup() happens in stop_machine context,
> please refer to cpu_callback() function in hvm.c and the time
> notifier_call_chain(CPU_DEAD) is called in cpu_down().

While that's true, the CPU at that point is no longer marked
online, so it shouldn't be a candidate anyway.

> Our goal here is to avoid adding one entry to a destroyed list.
> To avoid destruction happens during adding, we can put these two
> process in critical sections, like
> 
> add:
> 	remote_pbl_operation_begin()
> 	add one entry to the list
> 	remote_pbl_operation_end()
> 
> destroy:
> 	remote_pbl_operation_begin()
> 	destruction
> 	remote_pbl_operation_end()
> 
> Destruction may happen before we enter the critical section.

I don't think so, no: Xen is not preemptible, and stop-machine logic
involves scheduling a tasklet on each pCPU and waiting for it to
gain control. So as long as you don't "manually" force tasklets to
be run, I still don't see the need for this extra locking.

Jan
Chao Gao July 10, 2017, 11:42 a.m. UTC | #4
On Mon, Jul 10, 2017 at 03:36:52AM -0600, Jan Beulich wrote:
>>>> On 10.07.17 at 03:17, <chao.gao@intel.com> wrote:
>> On Fri, Jul 07, 2017 at 09:57:47AM -0600, Jan Beulich wrote:
>>>>>> On 07.07.17 at 08:48, <chao.gao@intel.com> wrote:
>>>> +#define remote_pbl_operation_begin(flags)                   \
>>>> +({                                                          \
>>>> +    spin_lock_irqsave(&remote_pbl_operation, flags);        \
>>>> +})
>>>> +
>>>> +#define remote_pbl_operation_done(flags)                    \
>>>> +({                                                          \
>>>> +    spin_unlock_irqrestore(&remote_pbl_operation, flags);   \
>>>> +})
>>>
>>>No need for the ({ }) here.
>>>
>>>But then I don't understand what this is needed for in the first
>>>place. If this is once again about CPU offlining, then I can only
>>>repeat that such happens in stop_machine context. Otherwise
>> 
>> But I don't think vmx_pi_desc_fixup() happens in stop_machine context,
>> please refer to cpu_callback() function in hvm.c and the time
>> notifier_call_chain(CPU_DEAD) is called in cpu_down().
>
>While that's true, the CPU at that point is no longer marked
>online, so it shouldn't be a candidate anyway.
>
>> Our goal here is to avoid adding one entry to a destroyed list.
>> To avoid destruction happens during adding, we can put these two
>> process in critical sections, like
>> 
>> add:
>> 	remote_pbl_operation_begin()
>> 	add one entry to the list
>> 	remote_pbl_operation_end()
>> 
>> destroy:
>> 	remote_pbl_operation_begin()
>> 	destruction
>> 	remote_pbl_operation_end()
>> 
>> Destruction may happen before we enter the critical section.
>
>I don't think so, no: Xen is not preemptible, and stop-machine logic
>involves scheduling a tasklet on each pCPU and waiting for it to
>gain control. So as long as you don't "manually" force tasklets to
>be run, I still don't see the need for this extra locking.

Impressive! I understand why you repeatedly mentioned stop_machine()
context to me now.

Thanks
Chao
George Dunlap July 21, 2017, 3:43 p.m. UTC | #5
On Fri, Jul 7, 2017 at 7:48 AM, Chao Gao <chao.gao@intel.com> wrote:
> Currently, a blocked vCPU is put in its pCPU's pi blocking list. If
> too many vCPUs are blocked on a given pCPU, it will incur that the list
> grows too long. After a simple analysis, there are 32k domains and
> 128 vcpu per domain, thus about 4M vCPUs may be blocked in one pCPU's
> PI blocking list. When a wakeup interrupt arrives, the list is
> traversed to find some specific vCPUs to wake them up. This traversal in
> that case would consume much time.
>
> To mitigate this issue, this patch limits the number of vCPUs tracked on a
> given pCPU's blocking list, taking factors such as perfomance of common case,
> current hvm vCPU count and current pCPU count into consideration. With this
> method, for the common case, it works fast and for some extreme cases, the
> list length is under control.
>
> With this patch, when a vcpu is to be blocked, we check whether the pi
> blocking list's length of the pcpu where the vcpu is running exceeds
> the limit which is the average vcpus per pcpu ratio plus a constant.
> If no, the vcpu is added to this pcpu's pi blocking list. Otherwise,
> another online pcpu is chosen to accept the vcpu.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
> v4:
>  - use a new lock to avoid adding a blocked vcpu to a offline pcpu's blocking
>  list.
>
> ---
>  xen/arch/x86/hvm/vmx/vmx.c | 136 +++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 114 insertions(+), 22 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index ecd6485..04e9aa6 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -95,22 +95,91 @@ static DEFINE_PER_CPU(struct vmx_pi_blocking_vcpu, vmx_pi_blocking);
>  uint8_t __read_mostly posted_intr_vector;
>  static uint8_t __read_mostly pi_wakeup_vector;
>
> +/*
> + * Protect critical sections to avoid adding a blocked vcpu to a destroyed
> + * blocking list.
> + */
> +static DEFINE_SPINLOCK(remote_pbl_operation);
> +
> +#define remote_pbl_operation_begin(flags)                   \
> +({                                                          \
> +    spin_lock_irqsave(&remote_pbl_operation, flags);        \
> +})
> +
> +#define remote_pbl_operation_done(flags)                    \
> +({                                                          \
> +    spin_unlock_irqrestore(&remote_pbl_operation, flags);   \
> +})
> +
>  void vmx_pi_per_cpu_init(unsigned int cpu)
>  {
>      INIT_LIST_HEAD(&per_cpu(vmx_pi_blocking, cpu).list);
>      spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
>  }
>
> +/*
> + * By default, the local pcpu (means the one the vcpu is currently running on)
> + * is chosen as the destination of wakeup interrupt. But if the vcpu number of
> + * the pcpu exceeds a limit, another pcpu is chosen until we find a suitable
> + * one.
> + *
> + * Currently, choose (v_tot/p_tot) + K as the limit of vcpu count, where
> + * v_tot is the total number of hvm vcpus on the system, p_tot is the total
> + * number of pcpus in the system, and K is a fixed number. An experment on a
> + * skylake server which has 112 cpus and 64G memory shows the maximum time to
> + * wakeup a vcpu from a 128-entry blocking list takes about 22us, which is
> + * tolerable. So choose 128 as the fixed number K.
> + *
> + * This policy makes sure:
> + * 1) for common cases, the limit won't be reached and the local pcpu is used
> + * which is beneficial to performance (at least, avoid an IPI when unblocking
> + * vcpu).
> + * 2) for the worst case, the blocking list length scales with the vcpu count
> + * divided by the pcpu count.
> + */
> +#define PI_LIST_FIXED_NUM 128
> +#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) / num_online_cpus() + \
> +                           PI_LIST_FIXED_NUM)
> +static inline bool pi_over_limit(int cpu)
> +{
> +    return per_cpu(vmx_pi_blocking, cpu).counter > PI_LIST_LIMIT;

Is there any reason to hide this calculation behind a #define, when
it's only used once anyway?

Also -- the vast majority of the time, .counter will be <
PI_LIST_FIXED_NUM; there's no reason to do an atomic read and an
integer division in that case.  I would do this:

if ( likely(per_cpu(vm_pi_blocking, cpu).counter <= PI_LIST_FIXED_LIMIT) )
  return 0;

return per_cpu(vm_pi_blocking, cpu).counter < PI_LIST_FIXED_LIMIT +
(atomic_read(&num_hvm_vcpus) / num_online_cpus));

Also, I personally think it would make the code more readable to say,
"pi_under_limit()" instead; that way...

> +}
> +
>  static void vmx_vcpu_block(struct vcpu *v)
>  {
> -    unsigned long flags;
> -    unsigned int dest;
> +    unsigned long flags[2];
> +    unsigned int dest, pi_cpu;
>      spinlock_t *old_lock;
> -    spinlock_t *pi_blocking_list_lock =
> -               &per_cpu(vmx_pi_blocking, v->processor).lock;
>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    spinlock_t *pi_blocking_list_lock;
> +    bool in_remote_operation = false;
> +
> +    pi_cpu = v->processor;
> +
> +    if ( unlikely(pi_over_limit(pi_cpu)) )
> +    {

...here you put the most common thing first, and the exceptional case
second (which I think makes the code easier to understand).

In fact, you might consider putting this whole thing in a function;
something like:

unsigned int pi_get_blocking_cpu(unsigned int pi_cpu, unsigned long &flags)
{
    if ( pi_under_limit(pi_cpu) ) {
     spin_lock_irqsave([pi lock], flags);
     return pi_cpu;
   }

  /* Loop looking for pi's in other places */
}

Probably also worth mentioning briefly in a comment why this loop is
guaranteed to terminate.

 -George
diff mbox

Patch

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index ecd6485..04e9aa6 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -95,22 +95,91 @@  static DEFINE_PER_CPU(struct vmx_pi_blocking_vcpu, vmx_pi_blocking);
 uint8_t __read_mostly posted_intr_vector;
 static uint8_t __read_mostly pi_wakeup_vector;
 
+/*
+ * Protect critical sections to avoid adding a blocked vcpu to a destroyed
+ * blocking list.
+ */
+static DEFINE_SPINLOCK(remote_pbl_operation);
+
+#define remote_pbl_operation_begin(flags)                   \
+({                                                          \
+    spin_lock_irqsave(&remote_pbl_operation, flags);        \
+})
+
+#define remote_pbl_operation_done(flags)                    \
+({                                                          \
+    spin_unlock_irqrestore(&remote_pbl_operation, flags);   \
+})
+
 void vmx_pi_per_cpu_init(unsigned int cpu)
 {
     INIT_LIST_HEAD(&per_cpu(vmx_pi_blocking, cpu).list);
     spin_lock_init(&per_cpu(vmx_pi_blocking, cpu).lock);
 }
 
+/*
+ * By default, the local pcpu (means the one the vcpu is currently running on)
+ * is chosen as the destination of wakeup interrupt. But if the vcpu number of
+ * the pcpu exceeds a limit, another pcpu is chosen until we find a suitable
+ * one.
+ *
+ * Currently, choose (v_tot/p_tot) + K as the limit of vcpu count, where
+ * v_tot is the total number of hvm vcpus on the system, p_tot is the total
+ * number of pcpus in the system, and K is a fixed number. An experment on a
+ * skylake server which has 112 cpus and 64G memory shows the maximum time to
+ * wakeup a vcpu from a 128-entry blocking list takes about 22us, which is
+ * tolerable. So choose 128 as the fixed number K.
+ *
+ * This policy makes sure:
+ * 1) for common cases, the limit won't be reached and the local pcpu is used
+ * which is beneficial to performance (at least, avoid an IPI when unblocking
+ * vcpu).
+ * 2) for the worst case, the blocking list length scales with the vcpu count
+ * divided by the pcpu count.
+ */
+#define PI_LIST_FIXED_NUM 128
+#define PI_LIST_LIMIT     (atomic_read(&num_hvm_vcpus) / num_online_cpus() + \
+                           PI_LIST_FIXED_NUM)
+static inline bool pi_over_limit(int cpu)
+{
+    return per_cpu(vmx_pi_blocking, cpu).counter > PI_LIST_LIMIT;
+}
+
 static void vmx_vcpu_block(struct vcpu *v)
 {
-    unsigned long flags;
-    unsigned int dest;
+    unsigned long flags[2];
+    unsigned int dest, pi_cpu;
     spinlock_t *old_lock;
-    spinlock_t *pi_blocking_list_lock =
-		&per_cpu(vmx_pi_blocking, v->processor).lock;
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    spinlock_t *pi_blocking_list_lock;
+    bool in_remote_operation = false;
+
+    pi_cpu = v->processor;
+
+    if ( unlikely(pi_over_limit(pi_cpu)) )
+    {
+        remote_pbl_operation_begin(flags[0]);
+        in_remote_operation = true;
+        while (true)
+        {
+            /*
+             * Online pCPU's blocking list is always usable for it is
+             * destroyed after clearing the bit of cpu_online_map.
+             */
+            pi_cpu = cpumask_cycle(pi_cpu, &cpu_online_map);
+            pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, pi_cpu).lock;
+            spin_lock_irqsave(pi_blocking_list_lock, flags[1]);
+            if ( !pi_over_limit(pi_cpu) )
+                break;
+            spin_unlock_irqrestore(pi_blocking_list_lock, flags[1]);
+        }
+    }
+    else
+    {
+        pi_blocking_list_lock = &per_cpu(vmx_pi_blocking, pi_cpu).lock;
+        spin_lock_irqsave(pi_blocking_list_lock, flags[1]);
+    }
 
-    spin_lock_irqsave(pi_blocking_list_lock, flags);
     old_lock = cmpxchg(&v->arch.hvm_vmx.pi_blocking.lock, NULL,
                        pi_blocking_list_lock);
 
@@ -123,8 +192,10 @@  static void vmx_vcpu_block(struct vcpu *v)
 
     per_cpu(vmx_pi_blocking, v->processor).counter++;
     list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
-                  &per_cpu(vmx_pi_blocking, v->processor).list);
-    spin_unlock_irqrestore(pi_blocking_list_lock, flags);
+                  &per_cpu(vmx_pi_blocking, pi_cpu).list);
+    spin_unlock_irqrestore(pi_blocking_list_lock, flags[1]);
+    if ( unlikely(in_remote_operation) )
+        remote_pbl_operation_done(flags[0]);
 
     ASSERT(!pi_test_sn(pi_desc));
 
@@ -132,6 +203,19 @@  static void vmx_vcpu_block(struct vcpu *v)
 
     ASSERT(pi_desc->ndst ==
            (x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK)));
+    if ( unlikely(pi_cpu != v->processor) )
+    {
+        /*
+         * The vcpu is put to another pCPU's blocking list. Change 'NDST'
+         * field to that pCPU to make sure it can wake up the vcpu when an
+         * interrupt arrives. The 'NDST' field will be set to the pCPU which
+         * the vcpu is running on during VM-Entry, seeing
+         * vmx_pi_unblock_vcpu().
+         */
+        dest = cpu_physical_id(pi_cpu);
+        write_atomic(&pi_desc->ndst,
+                  x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));
+    }
 
     write_atomic(&pi_desc->nv, pi_wakeup_vector);
 }
@@ -162,13 +246,17 @@  static void vmx_pi_unblock_vcpu(struct vcpu *v)
     unsigned long flags;
     spinlock_t *pi_blocking_list_lock;
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    unsigned int dest = cpu_physical_id(v->processor);
 
     /*
-     * Set 'NV' field back to posted_intr_vector, so the
-     * Posted-Interrupts can be delivered to the vCPU when
-     * it is running in non-root mode.
+     * Set 'NV' field back to posted_intr_vector and 'NDST' field to the pCPU
+     * where the vcpu is running (for this field may now point to another
+     * pCPU), so the Posted-Interrupts can be delivered to the vCPU when it
+     * is running in non-root mode.
      */
     write_atomic(&pi_desc->nv, posted_intr_vector);
+    write_atomic(&pi_desc->ndst,
+                 x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));
 
     pi_blocking_list_lock = v->arch.hvm_vmx.pi_blocking.lock;
 
@@ -215,13 +303,8 @@  void vmx_pi_desc_fixup(unsigned int cpu)
     if ( !iommu_intpost )
         return;
 
-    /*
-     * We are in the context of CPU_DEAD or CPU_UP_CANCELED notification,
-     * and it is impossible for a second CPU go down in parallel. So we
-     * can safely acquire the old cpu's lock and then acquire the new_cpu's
-     * lock after that.
-     */
-    spin_lock_irqsave(old_lock, flags);
+    remote_pbl_operation_begin(flags);
+    spin_lock(old_lock);
 
     list_for_each_entry_safe(vmx, tmp, blocked_vcpus, pi_blocking.list)
     {
@@ -245,16 +328,24 @@  void vmx_pi_desc_fixup(unsigned int cpu)
         }
         else
         {
+            new_cpu = cpu;
             /*
              * We need to find an online cpu as the NDST of the PI descriptor, it
              * doesn't matter whether it is within the cpupool of the domain or
              * not. As long as it is online, the vCPU will be woken up once the
-             * notification event arrives.
+             * notification event arrives. Through a while-loop to find a
+             * target pCPU whose PI Blocking List's length isn't over the limit.
              */
-            new_cpu = cpumask_any(&cpu_online_map);
-            new_lock = &per_cpu(vmx_pi_blocking, new_cpu).lock;
+            while (true)
+            {
+                new_cpu = cpumask_cycle(new_cpu, &cpu_online_map);
+                new_lock = &per_cpu(vmx_pi_blocking, new_cpu).lock;
 
-            spin_lock(new_lock);
+                spin_lock(new_lock);
+                if ( !pi_over_limit(new_cpu) )
+                    break;
+                spin_unlock(new_lock);
+            }
 
             ASSERT(vmx->pi_blocking.lock == old_lock);
 
@@ -274,7 +365,8 @@  void vmx_pi_desc_fixup(unsigned int cpu)
         pi_clear_sn(&vmx->pi_desc);
     }
 
-    spin_unlock_irqrestore(old_lock, flags);
+    spin_unlock(old_lock);
+    remote_pbl_operation_done(flags);
 }
 
 /*