[RFC,0/1] Make vCPUs that are HLT state candidates for load balancing

Message ID	20210526133727.42339-1-m.misono760@gmail.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> From: Masanori Misono <m.misono760@gmail.com> To: David Woodhouse <dwmw@amazon.co.uk>, Paolo Bonzini <pbonzini@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Rohit Jain <rohit.k.jain@oracle.com> Cc: Ingo Molnar <mingo@redhat.com>, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Masanori Misono <m.misono760@gmail.com> Subject: [PATCH RFC 0/1] Make vCPUs that are HLT state candidates for load balancing Date: Wed, 26 May 2021 22:37:26 +0900 Message-Id: <20210526133727.42339-1-m.misono760@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Make vCPUs that are HLT state candidates for load balancing \| expand [RFC,0/1] Make vCPUs that are HLT state candidates for load balancing [RFC,1/1] KVM: x86: Don't set preempted when vCPU does HLT VMEXIT

Message ID

20210526133727.42339-1-m.misono760@gmail.com (mailing list archive)

Headers

From: Masanori Misono <m.misono760@gmail.com>
To: David Woodhouse <dwmw@amazon.co.uk>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Rohit Jain <rohit.k.jain@oracle.com>
Cc: Ingo Molnar <mingo@redhat.com>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org,
        Masanori Misono <m.misono760@gmail.com>
Subject: [PATCH RFC 0/1] Make vCPUs that are HLT state candidates for load
 balancing
Date: Wed, 26 May 2021 22:37:26 +0900
Message-Id: <20210526133727.42339-1-m.misono760@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Make vCPUs that are HLT state candidates for load balancing | expand

Message

Masanori Misono May 26, 2021, 1:37 p.m. UTC

Hi,

I observed performance degradation when running some parallel programs on a
VM that has (1) KVM_FEATURE_PV_UNHALT, (2) KVM_FEATURE_STEAL_TIME, and (3)
multi-core architecture. The benchmark results are shown at the bottom. An
example of libvirt XML for creating such VM is

```
[...]
  <vcpu placement='static'>8</vcpu>
  <cpu mode='host-model'>
    <topology sockets='1' cores='8' threads='1'/>
  </cpu>
  <qemu:commandline>
    <qemu:arg value='-cpu'/>
    <qemu:arg value='host,l3-cache=on,+kvm-pv-unhalt,+kvm-steal-time'/>
  </qemu:commandline>
[...]
```

I investigate the cause and found that the problem occurs in the following
ways:

- vCPU1 schedules thread A, and vCPU2 schedules thread B. vCPU1 and vCPU2
  share LLC.
- Thread A tries to acquire a lock but fails, resulting in a sleep state
  (via futex.)
- vCPU1 becomes idle because there are no runnable threads and does HLT,
  which leads to HLT VMEXIT (if idle=halt, and KVM doesn't disable HLT
  VMEXIT using KVM_CAP_X86_DISABLE_EXITS).
- KVM sets vCPU1's st->preempted as 1 in kvm_steal_time_set_preempted().
- Thread C wakes on vCPU2. vCPU2 tries to do load balancing in
  select_idle_core(). Although vCPU1 is idle, vCPU1 is not a candidate for
  load balancing because is_vcpu_preempted(vCPU1) is true, hence
  available_idle_cpu(vPCU1) is false.
- As a result, both thread B and thread C stay in the vCPU2's runqueue, and
  vCPU1 is not utilized.

The patch changes kvm_arch_cpu_put() so that it does not set st->preempted
as 1 when a vCPU does HLT VMEXIT. As a result, is_vcpu_preempted(vCPU)
becomes 0, and the vCPU becomes a candidate for CFS load balancing.

The followings are parts of benchmark results of NPB-OMP
(https://www.nas.nasa.gov/publications/npb.html), which contains several
parallel computing programs. My machine has two nodes, and each CPU has 24
cores (Intel Xeon Platinum 8160, hyper-threading disabled.) I created a VM
with 48 vCPU, and each vCPU is pinned to the corresponding pCPU. I also
created virtual NUMA so that the guest environment became as close as the
host. Values in the tables are execution time (seconds; lower is better).

| environmnent \ benchmark name | lu.C   | mg.C  | bt.C  | cg.C  |
|-------------------------------+--------+-------+-------+-------|
| host (Linux v5.13-rc3)        | 50.67  | 14.67 | 54.77 | 20.08 |
| VM (sockets=48, cores=1)      | 51.37  | 14.88 | 55.99 | 20.05 |
| VM (sockets=2, cores=24)      | 170.12 | 23.86 | 75.95 | 40.15 |
|   w/ this patch               | 48.92  | 14.95 | 55.23 | 20.09 |


is_vcpu_preempted() is also used in PV spinlock implementations to mitigate
lock holder preemption problems, etc. A vCPU holding a lock does not do
HLT, so I think this patch doesn't affect them. However, pCPU may be
running the host's thread that has higher priority than a vCPU thread, and
in that case, is_vcpu_preempted() should return 0 ideally. I guess
its implementation would be a bit complicated, so I wonder if this patch
approach is acceptable.

Thanks,

Masanori Misono (1):
  KVM: x86: Don't set preempted when vCPU does HLT VMEXIT

 arch/x86/kvm/x86.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)


base-commit: c4681547bcce777daf576925a966ffa824edd09d

Comments

Peter Zijlstra May 26, 2021, 2:43 p.m. UTC | #1

On Wed, May 26, 2021 at 10:37:26PM +0900, Masanori Misono wrote:
> is_vcpu_preempted() is also used in PV spinlock implementations to mitigate
> lock holder preemption problems,

It's used to abort optimistic spinners.

> etc. A vCPU holding a lock does not do HLT,

Optimistic spinning is actually part of mutexes and rwsems too, and in
those cases we might very well end up in idle while holding the lock.
However; in that case the task will have been scheduled out and the
optimistic spin loop will terminate due to that (see the ->on_cpu
condition).

> so I think this patch doesn't affect them.

That is correct.

> However, pCPU may be
> running the host's thread that has higher priority than a vCPU thread, and
> in that case, is_vcpu_preempted() should return 0 ideally.

No, in that case vcpu_is_preempted() really should return true. There is
no saying how long the vcpu is gone for.

Peter Zijlstra May 26, 2021, 2:49 p.m. UTC | #2

On Wed, May 26, 2021 at 10:37:26PM +0900, Masanori Misono wrote:
> Hi,
> 
> I observed performance degradation when running some parallel programs on a
> VM that has (1) KVM_FEATURE_PV_UNHALT, (2) KVM_FEATURE_STEAL_TIME, and (3)
> multi-core architecture. The benchmark results are shown at the bottom. An
> example of libvirt XML for creating such VM is
> 
> ```
> [...]
>   <vcpu placement='static'>8</vcpu>
>   <cpu mode='host-model'>
>     <topology sockets='1' cores='8' threads='1'/>
>   </cpu>
>   <qemu:commandline>
>     <qemu:arg value='-cpu'/>
>     <qemu:arg value='host,l3-cache=on,+kvm-pv-unhalt,+kvm-steal-time'/>
>   </qemu:commandline>
> [...]
> ```
> 
> I investigate the cause and found that the problem occurs in the following
> ways:
> 
> - vCPU1 schedules thread A, and vCPU2 schedules thread B. vCPU1 and vCPU2
>   share LLC.
> - Thread A tries to acquire a lock but fails, resulting in a sleep state
>   (via futex.)
> - vCPU1 becomes idle because there are no runnable threads and does HLT,
>   which leads to HLT VMEXIT (if idle=halt, and KVM doesn't disable HLT
>   VMEXIT using KVM_CAP_X86_DISABLE_EXITS).
> - KVM sets vCPU1's st->preempted as 1 in kvm_steal_time_set_preempted().
> - Thread C wakes on vCPU2. vCPU2 tries to do load balancing in
>   select_idle_core(). Although vCPU1 is idle, vCPU1 is not a candidate for
>   load balancing because is_vcpu_preempted(vCPU1) is true, hence
>   available_idle_cpu(vPCU1) is false.
> - As a result, both thread B and thread C stay in the vCPU2's runqueue, and
>   vCPU1 is not utilized.
> 
> The patch changes kvm_arch_cpu_put() so that it does not set st->preempted
> as 1 when a vCPU does HLT VMEXIT. As a result, is_vcpu_preempted(vCPU)
> becomes 0, and the vCPU becomes a candidate for CFS load balancing.

I'm conficted on this; the vcpu stops running, the pcpu can go do
anything, it might start the next task. There is no saying how quickly
the vcpu task can return to running.

I'm guessing your setup doesn't actually overload the system; and when
it doesn't have the vcpu thread to run, the pcpu actually goes idle too.
But for those 1:1 cases we already have knobs to disable much of this
IIRC.

So I'm tempted to say things are working as expected and you're just not
configured right.

Sean Christopherson May 26, 2021, 4:15 p.m. UTC | #3

On Wed, May 26, 2021, Peter Zijlstra wrote:
> On Wed, May 26, 2021 at 10:37:26PM +0900, Masanori Misono wrote:
> > Hi,
> > 
> > I observed performance degradation when running some parallel programs on a
> > VM that has (1) KVM_FEATURE_PV_UNHALT, (2) KVM_FEATURE_STEAL_TIME, and (3)
> > multi-core architecture. The benchmark results are shown at the bottom. An
> > example of libvirt XML for creating such VM is
> > 
> > ```
> > [...]
> >   <vcpu placement='static'>8</vcpu>
> >   <cpu mode='host-model'>
> >     <topology sockets='1' cores='8' threads='1'/>
> >   </cpu>
> >   <qemu:commandline>
> >     <qemu:arg value='-cpu'/>
> >     <qemu:arg value='host,l3-cache=on,+kvm-pv-unhalt,+kvm-steal-time'/>
> >   </qemu:commandline>
> > [...]
> > ```
> > 
> > I investigate the cause and found that the problem occurs in the following
> > ways:
> > 
> > - vCPU1 schedules thread A, and vCPU2 schedules thread B. vCPU1 and vCPU2
> >   share LLC.
> > - Thread A tries to acquire a lock but fails, resulting in a sleep state
> >   (via futex.)
> > - vCPU1 becomes idle because there are no runnable threads and does HLT,
> >   which leads to HLT VMEXIT (if idle=halt, and KVM doesn't disable HLT
> >   VMEXIT using KVM_CAP_X86_DISABLE_EXITS).
> > - KVM sets vCPU1's st->preempted as 1 in kvm_steal_time_set_preempted().
> > - Thread C wakes on vCPU2. vCPU2 tries to do load balancing in
> >   select_idle_core(). Although vCPU1 is idle, vCPU1 is not a candidate for
> >   load balancing because is_vcpu_preempted(vCPU1) is true, hence
> >   available_idle_cpu(vPCU1) is false.
> > - As a result, both thread B and thread C stay in the vCPU2's runqueue, and
> >   vCPU1 is not utilized.

If a patch ever gets merged, please put this analysis (or at least a summary of
the problem) in the changelog.  From the patch itself, I thought "and the vCPU
becomes a candidate for CFS load balancing" was referring to CFS in the host,
which was obviously confusing.

> > The patch changes kvm_arch_cpu_put() so that it does not set st->preempted
> > as 1 when a vCPU does HLT VMEXIT. As a result, is_vcpu_preempted(vCPU)
> > becomes 0, and the vCPU becomes a candidate for CFS load balancing.
> 
> I'm conficted on this; the vcpu stops running, the pcpu can go do
> anything, it might start the next task. There is no saying how quickly
> the vcpu task can return to running.

Ya, the vCPU _is_ preempted after all.

> I'm guessing your setup doesn't actually overload the system; and when
> it doesn't have the vcpu thread to run, the pcpu actually goes idle too.
> But for those 1:1 cases we already have knobs to disable much of this
> IIRC.
> 
> So I'm tempted to say things are working as expected and you're just not
> configured right.

That does seem to be the case.  

> > I created a VM with 48 vCPU, and each vCPU is pinned to the corresponding pCPU.

If vCPUs are pinned and you want to eke out performance, then I think the correct
answer is to ensure nothing else can run on those pCPUs, and/or configure KVM to
not intercept HLT.