[RFC,0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

Message ID	20200721041742.197354-1-sergey.senozhatsky@gmail.com (mailing list archive)
Headers	show Return-Path: <SRS0=yukz=BA=lists.infradead.org=linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 74361208E4 From: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> To: Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com>, Julien Thierry <julien.thierry.kdev@gmail.com>, Suzuki K Poulose <suzuki.poulose@arm.com> Subject: [RFC][PATCH 0/4] arm64:kvm: teach guest sched that VCPUs can be preempted Date: Tue, 21 Jul 2020 13:17:38 +0900 Message-Id: <20200721041742.197354-1-sergey.senozhatsky@gmail.com> MIME-Version: 1.0 summary: Content analysis details: (-0.2 points) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider [sergey.senozhatsky[at]gmail.com] -0.0 SPF_PASS SPF: sender matches SPF record -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at https://www.dnswl.org/, no trust [2607:f8b0:4864:20:0:0:0:442 listed in] [list.dnswl.org] -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.1 DKIM_VALID_EF Message has a valid DKIM or DK signature from envelope-from domain Precedence: list Cc: joelaf@google.com, linux-kernel@vger.kernel.org, Sergey Senozhatsky <sergey.senozhatsky@gmail.com>, suleiman@google.com, kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	arm64:kvm: teach guest sched that VCPUs can be preempted \| expand [RFC,0/4] arm64:kvm: teach guest sched that VCPUs can be preempted [RFC,1/4] arm64:kvm: define pv_state SMCCC HV calls [RFC,2/4] arm64: add guest pvstate support [RFC,4/4] arm64: do not use dummy vcpu_is_preempted() anymore

Sergey Senozhatsky July 21, 2020, 4:17 a.m. UTC

Hello,

	RFC

	We noticed that in a number of cases when we wake_up_process()
on arm64 guest we end up enqueuing that task on a preempted VCPU. The culprit
appears to be the fact that arm64 guests are not aware of VCPU preemption
as such, so when sched picks up an idle VCPU it always assumes that VCPU
is available:

      wake_up_process()
       try_to_wake_up()
        select_task_rq_fair()
         available_idle_cpu()
          vcpu_is_preempted()    // return false;

Which is, obviously, not the case.

This RFC patch set adds a simple vcpu_is_preempted() implementation so
that scheduler can make better decisions when it search for the idle
(v)CPU.

I ran a number of sched benchmarks please refer to [0] for more
details.

[0] https://github.com/sergey-senozhatsky/arm64-vcpu_is_preempted

Sergey Senozhatsky (4):
  arm64:kvm: define pv_state SMCCC HV calls
  arm64: add guest pvstate support
  arm64: add host pvstate support
  arm64: do not use dummy vcpu_is_preempted() anymore

 arch/arm64/include/asm/kvm_host.h  |  23 ++++++
 arch/arm64/include/asm/paravirt.h  |  15 ++++
 arch/arm64/include/asm/spinlock.h  |  17 +++--
 arch/arm64/kernel/Makefile         |   2 +-
 arch/arm64/kernel/paravirt-state.c | 117 +++++++++++++++++++++++++++++
 arch/arm64/kernel/paravirt.c       |   4 +-
 arch/arm64/kernel/time.c           |   1 +
 arch/arm64/kvm/Makefile            |   2 +-
 arch/arm64/kvm/arm.c               |   4 +
 arch/arm64/kvm/hypercalls.c        |  11 +++
 arch/arm64/kvm/pvstate.c           |  58 ++++++++++++++
 include/linux/arm-smccc.h          |  18 +++++
 12 files changed, 262 insertions(+), 10 deletions(-)
 create mode 100644 arch/arm64/kernel/paravirt-state.c
 create mode 100644 arch/arm64/kvm/pvstate.c

Sergey Senozhatsky Aug. 17, 2020, 2:03 a.m. UTC | #1

On (20/07/21 13:17), Sergey Senozhatsky wrote:
> Hello,
> 
> 	RFC
> 
> 	We noticed that in a number of cases when we wake_up_process()
> on arm64 guest we end up enqueuing that task on a preempted VCPU. The culprit
> appears to be the fact that arm64 guests are not aware of VCPU preemption
> as such, so when sched picks up an idle VCPU it always assumes that VCPU
> is available:
> 
>       wake_up_process()
>        try_to_wake_up()
>         select_task_rq_fair()
>          available_idle_cpu()
>           vcpu_is_preempted()    // return false;
> 
> Which is, obviously, not the case.
> 
> This RFC patch set adds a simple vcpu_is_preempted() implementation so
> that scheduler can make better decisions when it search for the idle
> (v)CPU.

Hi,

A gentle ping.

	-ss

yezengruan Aug. 17, 2020, 12:03 p.m. UTC | #2

On 2020/8/17 10:03, Sergey Senozhatsky wrote:
> On (20/07/21 13:17), Sergey Senozhatsky wrote:
>> Hello,
>>
>> 	RFC
>>
>> 	We noticed that in a number of cases when we wake_up_process()
>> on arm64 guest we end up enqueuing that task on a preempted VCPU. The culprit
>> appears to be the fact that arm64 guests are not aware of VCPU preemption
>> as such, so when sched picks up an idle VCPU it always assumes that VCPU
>> is available:
>>
>>       wake_up_process()
>>        try_to_wake_up()
>>         select_task_rq_fair()
>>          available_idle_cpu()
>>           vcpu_is_preempted()    // return false;
>>
>> Which is, obviously, not the case.
>>
>> This RFC patch set adds a simple vcpu_is_preempted() implementation so
>> that scheduler can make better decisions when it search for the idle
>> (v)CPU.
> Hi,
>
> A gentle ping.
>
> 	-ss
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> .

Hi Sergey,

I have a set of patches similar to yours.

https://lore.kernel.org/lkml/20191226135833.1052-1-yezengruan@huawei.com/

Marc Zyngier Aug. 17, 2020, 12:25 p.m. UTC | #3

On 2020-08-17 13:03, yezengruan wrote:
> On 2020/8/17 10:03, Sergey Senozhatsky wrote:
>> On (20/07/21 13:17), Sergey Senozhatsky wrote:
>>> Hello,
>>> 
>>> 	RFC
>>> 
>>> 	We noticed that in a number of cases when we wake_up_process()
>>> on arm64 guest we end up enqueuing that task on a preempted VCPU. The 
>>> culprit
>>> appears to be the fact that arm64 guests are not aware of VCPU 
>>> preemption
>>> as such, so when sched picks up an idle VCPU it always assumes that 
>>> VCPU
>>> is available:
>>> 
>>>       wake_up_process()
>>>        try_to_wake_up()
>>>         select_task_rq_fair()
>>>          available_idle_cpu()
>>>           vcpu_is_preempted()    // return false;
>>> 
>>> Which is, obviously, not the case.
>>> 
>>> This RFC patch set adds a simple vcpu_is_preempted() implementation 
>>> so
>>> that scheduler can make better decisions when it search for the idle
>>> (v)CPU.
>> Hi,
>> 
>> A gentle ping.
>> 
>> 	-ss
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>> .
> 
> Hi Sergey,
> 
> I have a set of patches similar to yours.
> 
> https://lore.kernel.org/lkml/20191226135833.1052-1-yezengruan@huawei.com/

It really isn't the same thing at all. You are exposing PV spinlocks,
while Sergey exposes preemption to vcpus. The former is a massive,
and probably unnecessary superset of the later, which only impacts
the scheduler (it doesn't change the way locks are implemented).

You really shouldn't conflate the two (which you have done in your
series).

         M.

yezengruan Aug. 17, 2020, 2:15 p.m. UTC | #4

On 2020/8/17 20:25, Marc Zyngier wrote:
> On 2020-08-17 13:03, yezengruan wrote:
>> On 2020/8/17 10:03, Sergey Senozhatsky wrote:
>>> On (20/07/21 13:17), Sergey Senozhatsky wrote:
>>>> Hello,
>>>>
>>>>     RFC
>>>>
>>>>     We noticed that in a number of cases when we wake_up_process()
>>>> on arm64 guest we end up enqueuing that task on a preempted VCPU. The culprit
>>>> appears to be the fact that arm64 guests are not aware of VCPU preemption
>>>> as such, so when sched picks up an idle VCPU it always assumes that VCPU
>>>> is available:
>>>>
>>>>       wake_up_process()
>>>>        try_to_wake_up()
>>>>         select_task_rq_fair()
>>>>          available_idle_cpu()
>>>>           vcpu_is_preempted()    // return false;
>>>>
>>>> Which is, obviously, not the case.
>>>>
>>>> This RFC patch set adds a simple vcpu_is_preempted() implementation so
>>>> that scheduler can make better decisions when it search for the idle
>>>> (v)CPU.
>>> Hi,
>>>
>>> A gentle ping.
>>>
>>>     -ss
>>> _______________________________________________
>>> kvmarm mailing list
>>> kvmarm@lists.cs.columbia.edu
>>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>>> .
>>
>> Hi Sergey,
>>
>> I have a set of patches similar to yours.
>>
>> https://lore.kernel.org/lkml/20191226135833.1052-1-yezengruan@huawei.com/
>
> It really isn't the same thing at all. You are exposing PV spinlocks,
> while Sergey exposes preemption to vcpus. The former is a massive,
> and probably unnecessary superset of the later, which only impacts
> the scheduler (it doesn't change the way locks are implemented).
>
> You really shouldn't conflate the two (which you have done in your
> series).
>
>         M.


Hi Marc,

Actually, both series support paravirtualization vcpu_is_preempted. My
series regard this as PV lock, but only the vcpu_is_preempted interface
of pv_lock_opt is implemented.

Except wake_up_process(), the vcpu_is_preempted interface of the current
kernel is used in the following scenarios:

kernel/sched/core.c:                            <---- wake_up_process()
--------------------
available_idle_cpu
    vcpu_is_preempted

kernel/locking/rwsem.c:
-----------------------
rwsem_optimistic_spin
    rwsem_spin_on_owner
        owner_on_cpu
            vcpu_is_preempted

kernel/locking/mutex.c:
-----------------------
mutex_optimistic_spin
    mutex_spin_on_owner
        vcpu_is_preempted

kernel/locking/osq_lock.c:
--------------------------
osq_lock
    vcpu_is_preempted


Thanks,

Zengruan

Sergey Senozhatsky Sept. 11, 2020, 8:46 a.m. UTC | #5

Hi,

On (20/08/17 20:03), yezengruan wrote:
> Hi Sergey,
> 
> I have a set of patches similar to yours.
> 
> https://lore.kernel.org/lkml/20191226135833.1052-1-yezengruan@huawei.com/

I'm sorry for the belated reply.

Right, quite similar, but not exactly, I believe. I deliberately wanted
to untangle vcpu preemption (which is a characteristics feature) from
pv-lock, which may be somewhat implementation dependent.

Perhaps vcpu_is_preempted() should not even be implemented on per-arch
basis, but instead it can be more of a "core" functionality.

	-ss

Sergey Senozhatsky Sept. 11, 2020, 8:58 a.m. UTC | #6

My apologies for the slow reply.

On (20/08/17 13:25), Marc Zyngier wrote:
>
> It really isn't the same thing at all. You are exposing PV spinlocks,
> while Sergey exposes preemption to vcpus.
>

Correct, we see vcpu preemption as a "fundamental" feature, with
consequences that affect scheduling, which is a core feature :)

Marc, is there anything in particular that you dislike about this RFC
patch set? Joel has some ideas, which we may discuss offline if that
works for you.

	-ss

Joel Fernandes Dec. 8, 2020, 8:02 p.m. UTC | #7

On Fri, Sep 11, 2020 at 4:58 AM Sergey Senozhatsky
<sergey.senozhatsky@gmail.com> wrote:
>
> My apologies for the slow reply.
>
> On (20/08/17 13:25), Marc Zyngier wrote:
> >
> > It really isn't the same thing at all. You are exposing PV spinlocks,
> > while Sergey exposes preemption to vcpus.
> >
>
> Correct, we see vcpu preemption as a "fundamental" feature, with
> consequences that affect scheduling, which is a core feature :)
>
> Marc, is there anything in particular that you dislike about this RFC
> patch set? Joel has some ideas, which we may discuss offline if that
> works for you.

Hi Marc, Sergey, Just checking what is the latest on this series?

About the idea me and Sergey discussed, at a high level we discussed
being able to share information similar to "Is the vCPU preempted?"
using a more arch-independent infrastructure. I do not believe this
needs to be arch-specific. Maybe the speciifc mechanism about how to
share a page of information needs to be arch-specific, but the actual
information shared need not be. This could open the door to sharing
more such information in an arch-independent way (for example, if the
scheduler needs to know other information such as the capacity of the
CPU that the vCPU is on).

Other thoughts?

thanks,

 - Joel

Marc Zyngier Dec. 9, 2020, 9:43 a.m. UTC | #8

Hi all,

On 2020-12-08 20:02, Joel Fernandes wrote:
> On Fri, Sep 11, 2020 at 4:58 AM Sergey Senozhatsky
> <sergey.senozhatsky@gmail.com> wrote:
>> 
>> My apologies for the slow reply.
>> 
>> On (20/08/17 13:25), Marc Zyngier wrote:
>> >
>> > It really isn't the same thing at all. You are exposing PV spinlocks,
>> > while Sergey exposes preemption to vcpus.
>> >
>> 
>> Correct, we see vcpu preemption as a "fundamental" feature, with
>> consequences that affect scheduling, which is a core feature :)
>> 
>> Marc, is there anything in particular that you dislike about this RFC
>> patch set? Joel has some ideas, which we may discuss offline if that
>> works for you.
> 
> Hi Marc, Sergey, Just checking what is the latest on this series?

I was planning to give it a go, but obviously got sidetracked. :-(

> 
> About the idea me and Sergey discussed, at a high level we discussed
> being able to share information similar to "Is the vCPU preempted?"
> using a more arch-independent infrastructure. I do not believe this
> needs to be arch-specific. Maybe the speciifc mechanism about how to
> share a page of information needs to be arch-specific, but the actual
> information shared need not be.

We already have some information sharing in the form of steal time
accounting, and I believe this "vcpu preempted" falls in the same
bucket. It looks like we could implement the feature as an extension
of the steal-time accounting, as the two concepts are linked
(one describes the accumulation of non-running time, the other is
instantaneous).

> This could open the door to sharing
> more such information in an arch-independent way (for example, if the
> scheduler needs to know other information such as the capacity of the
> CPU that the vCPU is on).

Quentin and I have discussed potential ways of improving guest 
scheduling
on terminally broken systems (otherwise known as big-little), in the
form of a capacity request from the guest to the host. I'm not really
keen on the host exposing its own capacity, as that doesn't tell the
host what the guest actually needs.

Thanks,

         M.

Joel Fernandes Dec. 10, 2020, 1:39 a.m. UTC | #9

Hi Marc, nice to hear from you.

On Wed, Dec 9, 2020 at 4:43 AM Marc Zyngier <maz@kernel.org> wrote:
>
> Hi all,
>
> On 2020-12-08 20:02, Joel Fernandes wrote:
> > On Fri, Sep 11, 2020 at 4:58 AM Sergey Senozhatsky
> > <sergey.senozhatsky@gmail.com> wrote:
> >>
> >> My apologies for the slow reply.
> >>
> >> On (20/08/17 13:25), Marc Zyngier wrote:
> >> >
> >> > It really isn't the same thing at all. You are exposing PV spinlocks,
> >> > while Sergey exposes preemption to vcpus.
> >> >
> >>
> >> Correct, we see vcpu preemption as a "fundamental" feature, with
> >> consequences that affect scheduling, which is a core feature :)
> >>
> >> Marc, is there anything in particular that you dislike about this RFC
> >> patch set? Joel has some ideas, which we may discuss offline if that
> >> works for you.
> >
> > Hi Marc, Sergey, Just checking what is the latest on this series?
>
> I was planning to give it a go, but obviously got sidetracked. :-(

Ah, that happens.

> > About the idea me and Sergey discussed, at a high level we discussed
> > being able to share information similar to "Is the vCPU preempted?"
> > using a more arch-independent infrastructure. I do not believe this
> > needs to be arch-specific. Maybe the speciifc mechanism about how to
> > share a page of information needs to be arch-specific, but the actual
> > information shared need not be.
>
> We already have some information sharing in the form of steal time
> accounting, and I believe this "vcpu preempted" falls in the same
> bucket. It looks like we could implement the feature as an extension
> of the steal-time accounting, as the two concepts are linked
> (one describes the accumulation of non-running time, the other is
> instantaneous).

Yeah I noticed the steal stuff. Will go look more into that.

> > This could open the door to sharing
> > more such information in an arch-independent way (for example, if the
> > scheduler needs to know other information such as the capacity of the
> > CPU that the vCPU is on).
>
> Quentin and I have discussed potential ways of improving guest
> scheduling
> on terminally broken systems (otherwise known as big-little), in the
> form of a capacity request from the guest to the host. I'm not really
> keen on the host exposing its own capacity, as that doesn't tell the
> host what the guest actually needs.

I am not sure how a capacity request could work well. It seems the
cost of a repeated hypercall could be prohibitive. In this case, a
lighter approach might be for KVM to restrict vCPU threads to run on
certain types of cores, and pass the capacity information to the guest
at guest's boot time. This would be a one-time cost to pay. And then,
then the guest scheduler can handle the scheduling appropriately
without any more hypercalls. Thoughts?

- Joel

Marc Zyngier Dec. 10, 2020, 8:45 a.m. UTC | #10

On 2020-12-10 01:39, Joel Fernandes wrote:

[...]

>> Quentin and I have discussed potential ways of improving guest
>> scheduling
>> on terminally broken systems (otherwise known as big-little), in the
>> form of a capacity request from the guest to the host. I'm not really
>> keen on the host exposing its own capacity, as that doesn't tell the
>> host what the guest actually needs.
> 
> I am not sure how a capacity request could work well. It seems the
> cost of a repeated hypercall could be prohibitive. In this case, a
> lighter approach might be for KVM to restrict vCPU threads to run on
> certain types of cores, and pass the capacity information to the guest
> at guest's boot time.

That seems like a very narrow use case. If you actually pin vcpus to
physical CPU classes, DT is the right place to put things, because
it is completely static. This is effectively creating a virtual
big-little, which is in my opinion a userspace job.

> This would be a one-time cost to pay. And then,
> then the guest scheduler can handle the scheduling appropriately
> without any more hypercalls. Thoughts?

Anything that is a one-off belongs to firmware configuration, IMO.

The case I'm concerned with is when vcpus are allowed to roam across
the system, and hit random physical CPUs because the host has no idea
of the workload the guest deals with (specially as the AMU counters
are either absent or unusable on any available core).

The cost of a hypercall really depends on where you terminate it.
If it is a shallow exit, that's only a few hundred cycles on any half
baked CPU. Go all the way to userspace, and the host scheduler is the
limit. But the frequency of that hypercall obviously matters too.

How often do you expect the capacity request to fire? Probably not
on each and every time slice, right?

Quentin, can you shed some light on this?

Thanks,

         M.

Quentin Perret Dec. 11, 2020, 9:34 a.m. UTC | #11

On Thursday 10 Dec 2020 at 08:45:22 (+0000), Marc Zyngier wrote:
> On 2020-12-10 01:39, Joel Fernandes wrote:
> 
> [...]
> 
> > > Quentin and I have discussed potential ways of improving guest
> > > scheduling
> > > on terminally broken systems (otherwise known as big-little), in the
> > > form of a capacity request from the guest to the host. I'm not really
> > > keen on the host exposing its own capacity, as that doesn't tell the
> > > host what the guest actually needs.
> > 
> > I am not sure how a capacity request could work well. It seems the
> > cost of a repeated hypercall could be prohibitive. In this case, a
> > lighter approach might be for KVM to restrict vCPU threads to run on
> > certain types of cores, and pass the capacity information to the guest
> > at guest's boot time.
> 
> That seems like a very narrow use case. If you actually pin vcpus to
> physical CPU classes, DT is the right place to put things, because
> it is completely static. This is effectively creating a virtual
> big-little, which is in my opinion a userspace job.

+1, all you should need for this is to have the VMM pin the vCPUS and
set capacity-dmips-mhz in the guest DT accordingly. And if you're
worried about sharing the runqueue with host tasks, could you vacate the
host CPUs using cpusets or such?

The last difficult bit is how to drive DVFS. I suppose Marc's suggestion
to relay capacity requests from the guest would help with that.

> > This would be a one-time cost to pay. And then,
> > then the guest scheduler can handle the scheduling appropriately
> > without any more hypercalls. Thoughts?
> 
> Anything that is a one-off belongs to firmware configuration, IMO.
> 
> The case I'm concerned with is when vcpus are allowed to roam across
> the system, and hit random physical CPUs because the host has no idea
> of the workload the guest deals with (specially as the AMU counters
> are either absent or unusable on any available core).
> 
> The cost of a hypercall really depends on where you terminate it.
> If it is a shallow exit, that's only a few hundred cycles on any half
> baked CPU. Go all the way to userspace, and the host scheduler is the
> limit. But the frequency of that hypercall obviously matters too.
> 
> How often do you expect the capacity request to fire? Probably not
> on each and every time slice, right?
> 
> Quentin, can you shed some light on this?

Assuming that we change the 'capacity request' (aka uclamp.min of the
vCPU) every time the guest makes a frequency request, then the answer
very much is 'it depends on the workload'. Yes there is an overhead, but
I think it is hard to say how bad that would be before we give it a go.
It's unfortunately not uncommon to have painfully slow frequency changes
on real hardware, so this may be just fine. And there may be ways we
can mitigate this too (with rate limiting and such), so all in all it is
worth a try.

Also as per the above, this still would help even if the VMM pins vCPUs
and such, so these two things can live and complement each other I
think.

Now, for the patch originally under discussion here, no objection from
me in principle, it looks like a nice improvement to the stolen time
stuff and I can see how that could help some use-cases, so +1 from me.

Thanks,
Quentin

Joel Fernandes Dec. 16, 2020, 1:45 a.m. UTC | #12

Hi Marc, Quentin,

On Fri, Dec 11, 2020 at 4:34 AM Quentin Perret <qperret@google.com> wrote:
>
> On Thursday 10 Dec 2020 at 08:45:22 (+0000), Marc Zyngier wrote:
> > On 2020-12-10 01:39, Joel Fernandes wrote:
> >
> > [...]
> >
> > > > Quentin and I have discussed potential ways of improving guest
> > > > scheduling
> > > > on terminally broken systems (otherwise known as big-little), in the
> > > > form of a capacity request from the guest to the host. I'm not really
> > > > keen on the host exposing its own capacity, as that doesn't tell the
> > > > host what the guest actually needs.
> > >
> > > I am not sure how a capacity request could work well. It seems the
> > > cost of a repeated hypercall could be prohibitive. In this case, a
> > > lighter approach might be for KVM to restrict vCPU threads to run on
> > > certain types of cores, and pass the capacity information to the guest
> > > at guest's boot time.
> >
> > That seems like a very narrow use case. If you actually pin vcpus to
> > physical CPU classes, DT is the right place to put things, because
> > it is completely static. This is effectively creating a virtual
> > big-little, which is in my opinion a userspace job.
>
> +1, all you should need for this is to have the VMM pin the vCPUS and
> set capacity-dmips-mhz in the guest DT accordingly. And if you're
> worried about sharing the runqueue with host tasks, could you vacate the
> host CPUs using cpusets or such?

I agree, the VMM is the right place for it with appropriate DT
settings. I think this is similar to how CPUID is emulated on Intel as
well (for example to specify SMT topology for a vCPU) -- it is done by
the VMM.

On sharing vCPU with host tasks, that is indeed an issue because the
host does not know the priority of an app (For example, a "top app"
running in Android in a VM). The sharing with host tasks should be Ok
as long as the scheduler priorities of the vCPU threads on the host
are setup correctly?

> The last difficult bit is how to drive DVFS. I suppose Marc's suggestion
> to relay capacity requests from the guest would help with that.

Yeah I misunderstood Marc.  I think for DVFS, a hypercall for capacity
request should work and be infrequent enough. IIRC, there is some rate
limiting support in cpufreq governors as well that should reduce the
rate of hypercalls if needed.

> > > This would be a one-time cost to pay. And then,
> > > then the guest scheduler can handle the scheduling appropriately
> > > without any more hypercalls. Thoughts?
> >
> > Anything that is a one-off belongs to firmware configuration, IMO.
> >
> > The case I'm concerned with is when vcpus are allowed to roam across
> > the system, and hit random physical CPUs because the host has no idea
> > of the workload the guest deals with (specially as the AMU counters
> > are either absent or unusable on any available core).

It sounds like this might be a usecase for pinning the vCPU threads
appropriately (So designate a set of vCPU threads to only run on bigs
and another set to only run on LITTLEs).  The host can setup the DT to
describe this and the VM kernel's scheduler can do appropriate task
placement.  Did I miss anything?

> > The cost of a hypercall really depends on where you terminate it.
> > If it is a shallow exit, that's only a few hundred cycles on any half
> > baked CPU. Go all the way to userspace, and the host scheduler is the
> > limit. But the frequency of that hypercall obviously matters too.
> >
> > How often do you expect the capacity request to fire? Probably not
> > on each and every time slice, right?
> >
> > Quentin, can you shed some light on this?
>
> Assuming that we change the 'capacity request' (aka uclamp.min of the
> vCPU) every time the guest makes a frequency request, then the answer
> very much is 'it depends on the workload'. Yes there is an overhead, but
> I think it is hard to say how bad that would be before we give it a go.
> It's unfortunately not uncommon to have painfully slow frequency changes
> on real hardware, so this may be just fine. And there may be ways we
> can mitigate this too (with rate limiting and such), so all in all it is
> worth a try.

Agreed.

> Also as per the above, this still would help even if the VMM pins vCPUs
> and such, so these two things can live and complement each other I
> think.

Makes sense.

> Now, for the patch originally under discussion here, no objection from
> me in principle, it looks like a nice improvement to the stolen time
> stuff and I can see how that could help some use-cases, so +1 from me.

Sounds good!

thanks,

 - Joel

[RFC,0/4] arm64:kvm: teach guest sched that VCPUs can be preempted

Message

Comments