diff mbox

[0/4] mitigate the per-pCPU blocking list may be too long

Message ID 20170502054459.GA13105@skl-2s3.sh.intel.com
State New, archived
Headers show

Commit Message

Chao Gao May 2, 2017, 5:45 a.m. UTC
On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>On 26/04/17 01:52, Chao Gao wrote:
>> I compared the maximum of #entry in one list and #event (adding entry to
>> PI blocking list) with and without the three latter patches. Here
>> is the result:
>> -------------------------------------------------------------
>> |               |                      |                    |
>> |    Items      |   Maximum of #entry  |      #event        |
>> |               |                      |                    |
>> -------------------------------------------------------------
>> |               |                      |                    |
>> |W/ the patches |         6            |       22740        |
>> |               |                      |                    |
>> -------------------------------------------------------------
>> |               |                      |                    |
>> |W/O the patches|        128           |       46481        |
>> |               |                      |                    |
>> -------------------------------------------------------------
>
>Any chance you could trace how long the list traversal took?  It would
>be good for future reference to have an idea what kinds of timescales
>we're talking about.

Hi.

I made a simple test to get the time consumed by the list traversal.
Apply below patch and create one hvm guest with 128 vcpus and a passthrough 40 NIC.
All guest vcpu are pinned to one pcpu. collect data by
'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
xentrace_format. When the list length is about 128, the traversal time
is in the range of 1750 cycles to 39330 cycles. The physical cpu's
frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
to 22us. If 0.5ms is the upper bound the system can tolerate, at most
2900 vcpus can be added into the list.

I hope there is no error in the test and analysis.

Thanks
Chao

---8<---
From 504fd32bc042670812daad41efcd982434b98cd4 Mon Sep 17 00:00:00 2001
From: Chao Gao <chao.gao@intel.com>
Date: Wed, 26 Apr 2017 03:39:06 +0800
Subject: [PATCH] xentrace: trace PI-related events.

This patch adds TRC_HVM_VT_D_PI_BLOCK, TRC_HVM_PI_WAKEUP_START and
TRC_HVM_PI_WAKEUP_END to track PI-related events. Specifically,
TRC_HVM_VT_D_PI_BLOCK track adding one entry to the per-pcpu
blocking list. TRC_HVM_PI_WAKEUP_{START, END} mark the start and end
of PI blocking list traversal. Also introduce a 'counter' to track the
number of entries in the list.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 tools/xentrace/formats          |  3 +++
 xen/arch/x86/hvm/vmx/vmx.c      | 13 ++++++++++++-
 xen/include/asm-x86/hvm/trace.h |  3 +++
 xen/include/public/trace.h      |  3 +++
 4 files changed, 21 insertions(+), 1 deletion(-)

Comments

George Dunlap May 3, 2017, 10:08 a.m. UTC | #1
On 02/05/17 06:45, Chao Gao wrote:
> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>> On 26/04/17 01:52, Chao Gao wrote:
>>> I compared the maximum of #entry in one list and #event (adding entry to
>>> PI blocking list) with and without the three latter patches. Here
>>> is the result:
>>> -------------------------------------------------------------
>>> |               |                      |                    |
>>> |    Items      |   Maximum of #entry  |      #event        |
>>> |               |                      |                    |
>>> -------------------------------------------------------------
>>> |               |                      |                    |
>>> |W/ the patches |         6            |       22740        |
>>> |               |                      |                    |
>>> -------------------------------------------------------------
>>> |               |                      |                    |
>>> |W/O the patches|        128           |       46481        |
>>> |               |                      |                    |
>>> -------------------------------------------------------------
>>
>> Any chance you could trace how long the list traversal took?  It would
>> be good for future reference to have an idea what kinds of timescales
>> we're talking about.
> 
> Hi.
> 
> I made a simple test to get the time consumed by the list traversal.
> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 40 NIC.
> All guest vcpu are pinned to one pcpu. collect data by
> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
> xentrace_format. When the list length is about 128, the traversal time
> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
> 2900 vcpus can be added into the list.

Great, thanks Chao Gao, that's useful.  I'm not sure a fixed latency --
say 500us -- is the right thing to look at; if all 2900 vcpus arranged
to have interrupts staggered at 500us intervals it could easily lock up
the cpu for nearly a full second.  But I'm having trouble formulating a
good limit scenario.

In any case, 22us should be safe from a security standpoint*, and 128
should be pretty safe from a "make the common case fast" standpoint:
i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
traffic will be the least of your performance problems I should think.

 -George

* Waiting for Jan to contradict me on this one. :-)
Jan Beulich May 3, 2017, 10:21 a.m. UTC | #2
>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
> On 02/05/17 06:45, Chao Gao wrote:
>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>> On 26/04/17 01:52, Chao Gao wrote:
>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>> PI blocking list) with and without the three latter patches. Here
>>>> is the result:
>>>> -------------------------------------------------------------
>>>> |               |                      |                    |
>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>> |               |                      |                    |
>>>> -------------------------------------------------------------
>>>> |               |                      |                    |
>>>> |W/ the patches |         6            |       22740        |
>>>> |               |                      |                    |
>>>> -------------------------------------------------------------
>>>> |               |                      |                    |
>>>> |W/O the patches|        128           |       46481        |
>>>> |               |                      |                    |
>>>> -------------------------------------------------------------
>>>
>>> Any chance you could trace how long the list traversal took?  It would
>>> be good for future reference to have an idea what kinds of timescales
>>> we're talking about.
>> 
>> Hi.
>> 
>> I made a simple test to get the time consumed by the list traversal.
>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 40 NIC.
>> All guest vcpu are pinned to one pcpu. collect data by
>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>> xentrace_format. When the list length is about 128, the traversal time
>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>> 2900 vcpus can be added into the list.
> 
> Great, thanks Chao Gao, that's useful.

Looks like Chao Gao has been dropped ...

>  I'm not sure a fixed latency --
> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
> to have interrupts staggered at 500us intervals it could easily lock up
> the cpu for nearly a full second.  But I'm having trouble formulating a
> good limit scenario.
> 
> In any case, 22us should be safe from a security standpoint*, and 128
> should be pretty safe from a "make the common case fast" standpoint:
> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
> traffic will be the least of your performance problems I should think.
> 
>  -George
> 
> * Waiting for Jan to contradict me on this one. :-)

22us would certainly be fine, if this was the worst case scenario.
I'm not sure the value measured for 128 list entries can be easily
scaled to several thousands of them, due cache and/or NUMA
effects. I continue to think that we primarily need theoretical
proof of an upper boundary on list length being enforced, rather
than any measurements or randomized balancing. And just to be
clear - if someone overloads their system, I do not see a need to
have a guaranteed maximum list traversal latency here. All I ask
for is that list traversal time scales with total vCPU count divided
by pCPU count.

Jan
Jan Beulich May 8, 2017, 8:39 a.m. UTC | #3
>>> On 08.05.17 at 18:15, <chao.gao@intel.com> wrote:
> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>>> On 02/05/17 06:45, Chao Gao wrote:
>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>> is the result:
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |W/ the patches |         6            |       22740        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |W/O the patches|        128           |       46481        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>
>>>>> Any chance you could trace how long the list traversal took?  It would
>>>>> be good for future reference to have an idea what kinds of timescales
>>>>> we're talking about.
>>>> 
>>>> Hi.
>>>> 
>>>> I made a simple test to get the time consumed by the list traversal.
>>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 
> 40 NIC.
>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>> xentrace_format. When the list length is about 128, the traversal time
>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>> 2900 vcpus can be added into the list.
>>> 
>>> Great, thanks Chao Gao, that's useful.
>>
>>Looks like Chao Gao has been dropped ...
>>
>>>  I'm not sure a fixed latency --
>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>> to have interrupts staggered at 500us intervals it could easily lock up
>>> the cpu for nearly a full second.  But I'm having trouble formulating a
>>> good limit scenario.
>>> 
>>> In any case, 22us should be safe from a security standpoint*, and 128
>>> should be pretty safe from a "make the common case fast" standpoint:
>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>> traffic will be the least of your performance problems I should think.
>>> 
>>>  -George
>>> 
>>> * Waiting for Jan to contradict me on this one. :-)
>>
>>22us would certainly be fine, if this was the worst case scenario.
>>I'm not sure the value measured for 128 list entries can be easily
>>scaled to several thousands of them, due cache and/or NUMA
>>effects. I continue to think that we primarily need theoretical
>>proof of an upper boundary on list length being enforced, rather
>>than any measurements or randomized balancing. And just to be
>>clear - if someone overloads their system, I do not see a need to
>>have a guaranteed maximum list traversal latency here. All I ask
>>for is that list traversal time scales with total vCPU count divided
>>by pCPU count.
> 
> Thanks, Jan & George.
> 
> I think it is more clear to me about what should I do next step.
> 
> In my understanding, we should distribute the wakeup interrupts like
> this:
> 1. By default, distribute it to the local pCPU ('local' means the vCPU
> is on the pCPU) to make the common case fast.
> 2. With the list grows to a point where we think it may consumers too
> much time to traverse the list, also distribute wakeup interrupt to local
> pCPU, ignoring admin intentionally overloads their system.
> 3. When the list length reachs the theoretic average maximum (means
> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
> to another underutilized pCPU.
> 
> But, I am confused about that If we don't care someone overload their
> system, why we need the stage #3?  If not, I have no idea to meet Jan's
> request, the list traversal time scales with total vCPU count divided by
> pCPU count. Or we will reach stage #3 before stage #2?

Things is that imo point 2 is too fuzzy to be of any use, i.e. 3 should
take effect immediately. We don't mean to ignore any admin decisions
here, it is just that if they overload their systems, the net effect of 3
may still not be good enough to provide smooth behavior. But that's
then a result of them overloading their systems in the first place. IOW,
you should try to evenly distribute vCPU-s as soon as their count on
a given pCPU exceeds the calculated average.

Jan
George Dunlap May 8, 2017, 9:13 a.m. UTC | #4
On 08/05/17 17:15, Chao Gao wrote:
> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>>> On 02/05/17 06:45, Chao Gao wrote:
>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>> is the result:
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |W/ the patches |         6            |       22740        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |W/O the patches|        128           |       46481        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>
>>>>> Any chance you could trace how long the list traversal took?  It would
>>>>> be good for future reference to have an idea what kinds of timescales
>>>>> we're talking about.
>>>>
>>>> Hi.
>>>>
>>>> I made a simple test to get the time consumed by the list traversal.
>>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 40 NIC.
>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>> xentrace_format. When the list length is about 128, the traversal time
>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>> 2900 vcpus can be added into the list.
>>>
>>> Great, thanks Chao Gao, that's useful.
>>
>> Looks like Chao Gao has been dropped ...
>>
>>>  I'm not sure a fixed latency --
>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>> to have interrupts staggered at 500us intervals it could easily lock up
>>> the cpu for nearly a full second.  But I'm having trouble formulating a
>>> good limit scenario.
>>>
>>> In any case, 22us should be safe from a security standpoint*, and 128
>>> should be pretty safe from a "make the common case fast" standpoint:
>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>> traffic will be the least of your performance problems I should think.
>>>
>>>  -George
>>>
>>> * Waiting for Jan to contradict me on this one. :-)
>>
>> 22us would certainly be fine, if this was the worst case scenario.
>> I'm not sure the value measured for 128 list entries can be easily
>> scaled to several thousands of them, due cache and/or NUMA
>> effects. I continue to think that we primarily need theoretical
>> proof of an upper boundary on list length being enforced, rather
>> than any measurements or randomized balancing. And just to be
>> clear - if someone overloads their system, I do not see a need to
>> have a guaranteed maximum list traversal latency here. All I ask
>> for is that list traversal time scales with total vCPU count divided
>> by pCPU count.
> 
> Thanks, Jan & George.
> 
> I think it is more clear to me about what should I do next step.
> 
> In my understanding, we should distribute the wakeup interrupts like
> this:
> 1. By default, distribute it to the local pCPU ('local' means the vCPU
> is on the pCPU) to make the common case fast.
> 2. With the list grows to a point where we think it may consumers too
> much time to traverse the list, also distribute wakeup interrupt to local
> pCPU, ignoring admin intentionally overloads their system.
> 3. When the list length reachs the theoretic average maximum (means
> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
> to another underutilized pCPU.

By "maximal vCPU count" do you mean, "total number of active vcpus on
the system"?  Or some other theoretical maximum vcpu count (e.g., 32k
domans * 512 vcpus each or something)?

What about saying that the limit of vcpus for any given pcpu will be:
 (v_tot / p_tot) + K
where v_tot is the total number of vcpus on the system, p_tot is the
total number of pcpus in the system, and K is a fixed number (such as
128) such that 1) the additional time walking the list is minimal, and
2) in the common case we should never come close to reaching that number?

Then the algorithm for choosing which pcpu to have the interrupt
delivered to would be:
 1. Set p = current_pcpu
 2. if len(list(p)) < v_tot / p_tot + k, choose p
 3. Otherwise, choose another p and goto 2

The "choose another p" could be random / pseudorandom selection, or it
could be some other mechanism (rotate, look for pcpus nearby on the
topology, choose the lowest one, &c).  But as long as we check the
length before assigning it, it should satisfy Jan.

 -George
Jan Beulich May 8, 2017, 9:24 a.m. UTC | #5
(Chao Gao got lost from the recipients list again; re-adding)

>>> On 08.05.17 at 11:13, <george.dunlap@citrix.com> wrote:
> On 08/05/17 17:15, Chao Gao wrote:
>> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>>>> On 02/05/17 06:45, Chao Gao wrote:
>>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>>> is the result:
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |W/ the patches |         6            |       22740        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |W/O the patches|        128           |       46481        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>
>>>>>> Any chance you could trace how long the list traversal took?  It would
>>>>>> be good for future reference to have an idea what kinds of timescales
>>>>>> we're talking about.
>>>>>
>>>>> Hi.
>>>>>
>>>>> I made a simple test to get the time consumed by the list traversal.
>>>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 
> 40 NIC.
>>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>>> xentrace_format. When the list length is about 128, the traversal time
>>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>>> 2900 vcpus can be added into the list.
>>>>
>>>> Great, thanks Chao Gao, that's useful.
>>>
>>> Looks like Chao Gao has been dropped ...
>>>
>>>>  I'm not sure a fixed latency --
>>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>>> to have interrupts staggered at 500us intervals it could easily lock up
>>>> the cpu for nearly a full second.  But I'm having trouble formulating a
>>>> good limit scenario.
>>>>
>>>> In any case, 22us should be safe from a security standpoint*, and 128
>>>> should be pretty safe from a "make the common case fast" standpoint:
>>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>>> traffic will be the least of your performance problems I should think.
>>>>
>>>>  -George
>>>>
>>>> * Waiting for Jan to contradict me on this one. :-)
>>>
>>> 22us would certainly be fine, if this was the worst case scenario.
>>> I'm not sure the value measured for 128 list entries can be easily
>>> scaled to several thousands of them, due cache and/or NUMA
>>> effects. I continue to think that we primarily need theoretical
>>> proof of an upper boundary on list length being enforced, rather
>>> than any measurements or randomized balancing. And just to be
>>> clear - if someone overloads their system, I do not see a need to
>>> have a guaranteed maximum list traversal latency here. All I ask
>>> for is that list traversal time scales with total vCPU count divided
>>> by pCPU count.
>> 
>> Thanks, Jan & George.
>> 
>> I think it is more clear to me about what should I do next step.
>> 
>> In my understanding, we should distribute the wakeup interrupts like
>> this:
>> 1. By default, distribute it to the local pCPU ('local' means the vCPU
>> is on the pCPU) to make the common case fast.
>> 2. With the list grows to a point where we think it may consumers too
>> much time to traverse the list, also distribute wakeup interrupt to local
>> pCPU, ignoring admin intentionally overloads their system.
>> 3. When the list length reachs the theoretic average maximum (means
>> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
>> to another underutilized pCPU.
> 
> By "maximal vCPU count" do you mean, "total number of active vcpus on
> the system"?  Or some other theoretical maximum vcpu count (e.g., 32k
> domans * 512 vcpus each or something)?

The former.

> What about saying that the limit of vcpus for any given pcpu will be:
>  (v_tot / p_tot) + K
> where v_tot is the total number of vcpus on the system, p_tot is the
> total number of pcpus in the system, and K is a fixed number (such as
> 128) such that 1) the additional time walking the list is minimal, and
> 2) in the common case we should never come close to reaching that number?
> 
> Then the algorithm for choosing which pcpu to have the interrupt
> delivered to would be:
>  1. Set p = current_pcpu
>  2. if len(list(p)) < v_tot / p_tot + k, choose p
>  3. Otherwise, choose another p and goto 2
> 
> The "choose another p" could be random / pseudorandom selection, or it
> could be some other mechanism (rotate, look for pcpus nearby on the
> topology, choose the lowest one, &c).  But as long as we check the
> length before assigning it, it should satisfy Jan.

Right.

Jan
Chao Gao May 8, 2017, 4:15 p.m. UTC | #6
On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>> On 02/05/17 06:45, Chao Gao wrote:
>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>> PI blocking list) with and without the three latter patches. Here
>>>>> is the result:
>>>>> -------------------------------------------------------------
>>>>> |               |                      |                    |
>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>> |               |                      |                    |
>>>>> -------------------------------------------------------------
>>>>> |               |                      |                    |
>>>>> |W/ the patches |         6            |       22740        |
>>>>> |               |                      |                    |
>>>>> -------------------------------------------------------------
>>>>> |               |                      |                    |
>>>>> |W/O the patches|        128           |       46481        |
>>>>> |               |                      |                    |
>>>>> -------------------------------------------------------------
>>>>
>>>> Any chance you could trace how long the list traversal took?  It would
>>>> be good for future reference to have an idea what kinds of timescales
>>>> we're talking about.
>>> 
>>> Hi.
>>> 
>>> I made a simple test to get the time consumed by the list traversal.
>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 40 NIC.
>>> All guest vcpu are pinned to one pcpu. collect data by
>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>> xentrace_format. When the list length is about 128, the traversal time
>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>> 2900 vcpus can be added into the list.
>> 
>> Great, thanks Chao Gao, that's useful.
>
>Looks like Chao Gao has been dropped ...
>
>>  I'm not sure a fixed latency --
>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>> to have interrupts staggered at 500us intervals it could easily lock up
>> the cpu for nearly a full second.  But I'm having trouble formulating a
>> good limit scenario.
>> 
>> In any case, 22us should be safe from a security standpoint*, and 128
>> should be pretty safe from a "make the common case fast" standpoint:
>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>> traffic will be the least of your performance problems I should think.
>> 
>>  -George
>> 
>> * Waiting for Jan to contradict me on this one. :-)
>
>22us would certainly be fine, if this was the worst case scenario.
>I'm not sure the value measured for 128 list entries can be easily
>scaled to several thousands of them, due cache and/or NUMA
>effects. I continue to think that we primarily need theoretical
>proof of an upper boundary on list length being enforced, rather
>than any measurements or randomized balancing. And just to be
>clear - if someone overloads their system, I do not see a need to
>have a guaranteed maximum list traversal latency here. All I ask
>for is that list traversal time scales with total vCPU count divided
>by pCPU count.

Thanks, Jan & George.

I think it is more clear to me about what should I do next step.

In my understanding, we should distribute the wakeup interrupts like
this:
1. By default, distribute it to the local pCPU ('local' means the vCPU
is on the pCPU) to make the common case fast.
2. With the list grows to a point where we think it may consumers too
much time to traverse the list, also distribute wakeup interrupt to local
pCPU, ignoring admin intentionally overloads their system.
3. When the list length reachs the theoretic average maximum (means
maximal vCPU count divided by pCPU count), distribute wakeup interrupt
to another underutilized pCPU.

But, I am confused about that If we don't care someone overload their
system, why we need the stage #3?  If not, I have no idea to meet Jan's
request, the list traversal time scales with total vCPU count divided by
pCPU count. Or we will reach stage #3 before stage #2?

Thanks
Chao
Chao Gao May 8, 2017, 4:38 p.m. UTC | #7
On Mon, May 08, 2017 at 02:39:25AM -0600, Jan Beulich wrote:
>>>> On 08.05.17 at 18:15, <chao.gao@intel.com> wrote:
>> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>>>> On 02/05/17 06:45, Chao Gao wrote:
>>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>>> is the result:
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |W/ the patches |         6            |       22740        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>> |               |                      |                    |
>>>>>>> |W/O the patches|        128           |       46481        |
>>>>>>> |               |                      |                    |
>>>>>>> -------------------------------------------------------------
>>>>>>
>>>>>> Any chance you could trace how long the list traversal took?  It would
>>>>>> be good for future reference to have an idea what kinds of timescales
>>>>>> we're talking about.
>>>>> 
>>>>> Hi.
>>>>> 
>>>>> I made a simple test to get the time consumed by the list traversal.
>>>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 
>> 40 NIC.
>>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>>> xentrace_format. When the list length is about 128, the traversal time
>>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>>> 2900 vcpus can be added into the list.
>>>> 
>>>> Great, thanks Chao Gao, that's useful.
>>>
>>>Looks like Chao Gao has been dropped ...
>>>
>>>>  I'm not sure a fixed latency --
>>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>>> to have interrupts staggered at 500us intervals it could easily lock up
>>>> the cpu for nearly a full second.  But I'm having trouble formulating a
>>>> good limit scenario.
>>>> 
>>>> In any case, 22us should be safe from a security standpoint*, and 128
>>>> should be pretty safe from a "make the common case fast" standpoint:
>>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>>> traffic will be the least of your performance problems I should think.
>>>> 
>>>>  -George
>>>> 
>>>> * Waiting for Jan to contradict me on this one. :-)
>>>
>>>22us would certainly be fine, if this was the worst case scenario.
>>>I'm not sure the value measured for 128 list entries can be easily
>>>scaled to several thousands of them, due cache and/or NUMA
>>>effects. I continue to think that we primarily need theoretical
>>>proof of an upper boundary on list length being enforced, rather
>>>than any measurements or randomized balancing. And just to be
>>>clear - if someone overloads their system, I do not see a need to
>>>have a guaranteed maximum list traversal latency here. All I ask
>>>for is that list traversal time scales with total vCPU count divided
>>>by pCPU count.
>> 
>> Thanks, Jan & George.
>> 
>> I think it is more clear to me about what should I do next step.
>> 
>> In my understanding, we should distribute the wakeup interrupts like
>> this:
>> 1. By default, distribute it to the local pCPU ('local' means the vCPU
>> is on the pCPU) to make the common case fast.
>> 2. With the list grows to a point where we think it may consumers too
>> much time to traverse the list, also distribute wakeup interrupt to local
>> pCPU, ignoring admin intentionally overloads their system.
>> 3. When the list length reachs the theoretic average maximum (means
>> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
>> to another underutilized pCPU.
>> 
>> But, I am confused about that If we don't care someone overload their
>> system, why we need the stage #3?  If not, I have no idea to meet Jan's
>> request, the list traversal time scales with total vCPU count divided by
>> pCPU count. Or we will reach stage #3 before stage #2?
>
>Things is that imo point 2 is too fuzzy to be of any use, i.e. 3 should
>take effect immediately. We don't mean to ignore any admin decisions
>here, it is just that if they overload their systems, the net effect of 3
>may still not be good enough to provide smooth behavior. But that's
>then a result of them overloading their systems in the first place. IOW,
>you should try to evenly distribute vCPU-s as soon as their count on
>a given pCPU exceeds the calculated average.

Very helpful and reasonable. Thank you, Jan.
Chao Gao May 8, 2017, 5:37 p.m. UTC | #8
On Mon, May 08, 2017 at 03:24:47AM -0600, Jan Beulich wrote:
>(Chao Gao got lost from the recipients list again; re-adding)
>
>>>> On 08.05.17 at 11:13, <george.dunlap@citrix.com> wrote:
>> On 08/05/17 17:15, Chao Gao wrote:
>>> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>>>> On 03.05.17 at 12:08, <george.dunlap@citrix.com> wrote:
>>>>> On 02/05/17 06:45, Chao Gao wrote:
>>>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>>>> is the result:
>>>>>>>> -------------------------------------------------------------
>>>>>>>> |               |                      |                    |
>>>>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>>>>> |               |                      |                    |
>>>>>>>> -------------------------------------------------------------
>>>>>>>> |               |                      |                    |
>>>>>>>> |W/ the patches |         6            |       22740        |
>>>>>>>> |               |                      |                    |
>>>>>>>> -------------------------------------------------------------
>>>>>>>> |               |                      |                    |
>>>>>>>> |W/O the patches|        128           |       46481        |
>>>>>>>> |               |                      |                    |
>>>>>>>> -------------------------------------------------------------
>>>>>>>
>>>>>>> Any chance you could trace how long the list traversal took?  It would
>>>>>>> be good for future reference to have an idea what kinds of timescales
>>>>>>> we're talking about.
>>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> I made a simple test to get the time consumed by the list traversal.
>>>>>> Apply below patch and create one hvm guest with 128 vcpus and a passthrough 
>> 40 NIC.
>>>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>>>> xentrace_format. When the list length is about 128, the traversal time
>>>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 1us
>>>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>>>> 2900 vcpus can be added into the list.
>>>>>
>>>>> Great, thanks Chao Gao, that's useful.
>>>>
>>>> Looks like Chao Gao has been dropped ...
>>>>
>>>>>  I'm not sure a fixed latency --
>>>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>>>> to have interrupts staggered at 500us intervals it could easily lock up
>>>>> the cpu for nearly a full second.  But I'm having trouble formulating a
>>>>> good limit scenario.
>>>>>
>>>>> In any case, 22us should be safe from a security standpoint*, and 128
>>>>> should be pretty safe from a "make the common case fast" standpoint:
>>>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>>>> traffic will be the least of your performance problems I should think.
>>>>>
>>>>>  -George
>>>>>
>>>>> * Waiting for Jan to contradict me on this one. :-)
>>>>
>>>> 22us would certainly be fine, if this was the worst case scenario.
>>>> I'm not sure the value measured for 128 list entries can be easily
>>>> scaled to several thousands of them, due cache and/or NUMA
>>>> effects. I continue to think that we primarily need theoretical
>>>> proof of an upper boundary on list length being enforced, rather
>>>> than any measurements or randomized balancing. And just to be
>>>> clear - if someone overloads their system, I do not see a need to
>>>> have a guaranteed maximum list traversal latency here. All I ask
>>>> for is that list traversal time scales with total vCPU count divided
>>>> by pCPU count.
>>> 
>>> Thanks, Jan & George.
>>> 
>>> I think it is more clear to me about what should I do next step.
>>> 
>>> In my understanding, we should distribute the wakeup interrupts like
>>> this:
>>> 1. By default, distribute it to the local pCPU ('local' means the vCPU
>>> is on the pCPU) to make the common case fast.
>>> 2. With the list grows to a point where we think it may consumers too
>>> much time to traverse the list, also distribute wakeup interrupt to local
>>> pCPU, ignoring admin intentionally overloads their system.
>>> 3. When the list length reachs the theoretic average maximum (means
>>> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
>>> to another underutilized pCPU.
>> 
>> By "maximal vCPU count" do you mean, "total number of active vcpus on
>> the system"?  Or some other theoretical maximum vcpu count (e.g., 32k
>> domans * 512 vcpus each or something)?
>
>The former.

Ok. Actually I meant the latter. But now, I realize I was wrong.

>
>> What about saying that the limit of vcpus for any given pcpu will be:
>>  (v_tot / p_tot) + K
>> where v_tot is the total number of vcpus on the system, p_tot is the
>> total number of pcpus in the system, and K is a fixed number (such as
>> 128) such that 1) the additional time walking the list is minimal, and
>> 2) in the common case we should never come close to reaching that number?
>> 
>> Then the algorithm for choosing which pcpu to have the interrupt
>> delivered to would be:
>>  1. Set p = current_pcpu
>>  2. if len(list(p)) < v_tot / p_tot + k, choose p
>>  3. Otherwise, choose another p and goto 2
>> 
>> The "choose another p" could be random / pseudorandom selection, or it
>> could be some other mechanism (rotate, look for pcpus nearby on the
>> topology, choose the lowest one, &c).  But as long as we check the
>> length before assigning it, it should satisfy Jan.

Very clear and helpful. Othewise, I may need spending several months to
reach this solution. Thanks, George. :)
diff mbox

Patch

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index 8b31780..34ed9e4 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -125,6 +125,9 @@ 
 0x00082020  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  INTR_WINDOW [ value = 0x%(1)08x ]
 0x00082021  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  NPF         [ gpa = 0x%(2)08x%(1)08x mfn = 0x%(4)08x%(3)08x qual = 0x%(5)04x p2mt = 0x%(6)04x ]
 0x00082023  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  TRAP        [ vector = 0x%(1)02x ]
+0x00082026  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  PI_BLOCK_LIST  [ domid = 0x%(1)04x vcpu = 0x%(2)04x, pcpu = 0x%(3)04x, #entry = 0x%(4)04x ]
+0x00082027  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  PI_WAKEUP_START [ list_len = 0x%(1)04x ]
+0x00082028  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  PI_WAKEUP_END
 
 0x0010f001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_map      [ domid = %(1)d ]
 0x0010f002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  page_grant_unmap    [ domid = %(1)d ]
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index c8ef18a..3a6640b 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -82,6 +82,7 @@  static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
 struct vmx_pi_blocking_vcpu {
     struct list_head     list;
     spinlock_t           lock;
+    atomic_t             counter;
 };
 
 /*
@@ -119,6 +120,9 @@  static void vmx_vcpu_block(struct vcpu *v)
      */
     ASSERT(old_lock == NULL);
 
+    atomic_inc(&per_cpu(vmx_pi_blocking, v->processor).counter);
+    HVMTRACE_4D(VT_D_PI_BLOCK, v->domain->domain_id, v->vcpu_id, v->processor,
+                atomic_read(&per_cpu(vmx_pi_blocking, v->processor).counter));
     list_add_tail(&v->arch.hvm_vmx.pi_blocking.list,
                   &per_cpu(vmx_pi_blocking, v->processor).list);
     spin_unlock_irqrestore(pi_blocking_list_lock, flags);
@@ -186,6 +190,8 @@  static void vmx_pi_unblock_vcpu(struct vcpu *v)
     {
         ASSERT(v->arch.hvm_vmx.pi_blocking.lock == pi_blocking_list_lock);
         list_del(&v->arch.hvm_vmx.pi_blocking.list);
+        atomic_dec(&container_of(pi_blocking_list_lock,
+                                 struct vmx_pi_blocking_vcpu, lock)->counter);
         v->arch.hvm_vmx.pi_blocking.lock = NULL;
     }
 
@@ -234,6 +240,7 @@  void vmx_pi_desc_fixup(unsigned int cpu)
         if ( pi_test_on(&vmx->pi_desc) )
         {
             list_del(&vmx->pi_blocking.list);
+            atomic_dec(&per_cpu(vmx_pi_blocking, cpu).counter);
             vmx->pi_blocking.lock = NULL;
             vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
         }
@@ -2360,13 +2367,15 @@  static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
     struct arch_vmx_struct *vmx, *tmp;
     spinlock_t *lock = &per_cpu(vmx_pi_blocking, smp_processor_id()).lock;
     struct list_head *blocked_vcpus =
-		&per_cpu(vmx_pi_blocking, smp_processor_id()).list;
+               &per_cpu(vmx_pi_blocking, smp_processor_id()).list;
 
     ack_APIC_irq();
     this_cpu(irq_count)++;
 
     spin_lock(lock);
 
+    TRACE_1D(TRC_HVM_PI_WAKEUP_START,
+        atomic_read(&per_cpu(vmx_pi_blocking, smp_processor_id()).counter));
     /*
      * XXX: The length of the list depends on how many vCPU is current
      * blocked on this specific pCPU. This may hurt the interrupt latency
@@ -2377,11 +2386,13 @@  static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
         if ( pi_test_on(&vmx->pi_desc) )
         {
             list_del(&vmx->pi_blocking.list);
+            atomic_dec(&per_cpu(vmx_pi_blocking, smp_processor_id()).counter);
             ASSERT(vmx->pi_blocking.lock == lock);
             vmx->pi_blocking.lock = NULL;
             vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
         }
     }
+    TRACE_0D(TRC_HVM_PI_WAKEUP_END);
 
     spin_unlock(lock);
 }
diff --git a/xen/include/asm-x86/hvm/trace.h b/xen/include/asm-x86/hvm/trace.h
index de802a6..afe8b75 100644
--- a/xen/include/asm-x86/hvm/trace.h
+++ b/xen/include/asm-x86/hvm/trace.h
@@ -54,6 +54,9 @@ 
 #define DO_TRC_HVM_TRAP             DEFAULT_HVM_MISC
 #define DO_TRC_HVM_TRAP_DEBUG       DEFAULT_HVM_MISC
 #define DO_TRC_HVM_VLAPIC           DEFAULT_HVM_MISC
+#define DO_TRC_HVM_VT_D_PI_BLOCK    DEFAULT_HVM_MISC
+#define DO_TRC_HVM_PI_WAKEUP_START  DEFAULT_HVM_MISC
+#define DO_TRC_HVM_PI_WAKEUP_END    DEFAULT_HVM_MISC
 
 
 #define TRC_PAR_LONG(par) ((par)&0xFFFFFFFF),((par)>>32)
diff --git a/xen/include/public/trace.h b/xen/include/public/trace.h
index 7f2e891..c5b95ee 100644
--- a/xen/include/public/trace.h
+++ b/xen/include/public/trace.h
@@ -234,6 +234,9 @@ 
 #define TRC_HVM_TRAP             (TRC_HVM_HANDLER + 0x23)
 #define TRC_HVM_TRAP_DEBUG       (TRC_HVM_HANDLER + 0x24)
 #define TRC_HVM_VLAPIC           (TRC_HVM_HANDLER + 0x25)
+#define TRC_HVM_VT_D_PI_BLOCK    (TRC_HVM_HANDLER + 0x26)
+#define TRC_HVM_PI_WAKEUP_START  (TRC_HVM_HANDLER + 0x27)
+#define TRC_HVM_PI_WAKEUP_END    (TRC_HVM_HANDLER + 0x28)
 
 #define TRC_HVM_IOPORT_WRITE    (TRC_HVM_HANDLER + 0x216)
 #define TRC_HVM_IOMEM_WRITE     (TRC_HVM_HANDLER + 0x217)