mbox series

[RFC,v2,0/5] Paravirt Scheduling (Dynamic vcpu priority management)

Message ID 20240403140116.3002809-1-vineeth@bitbyteword.org (mailing list archive)
Headers show
Series Paravirt Scheduling (Dynamic vcpu priority management) | expand

Message

Vineeth Remanan Pillai April 3, 2024, 2:01 p.m. UTC
Double scheduling is a concern with virtualization hosts where the host
schedules vcpus without knowing whats run by the vcpu and guest schedules
tasks without knowing where the vcpu is physically running. This causes
issues related to latencies, power consumption, resource utilization
etc. An ideal solution would be to have a cooperative scheduling
framework where the guest and host shares scheduling related information
and makes an educated scheduling decision to optimally handle the
workloads. As a first step, we are taking a stab at reducing latencies
for latency sensitive workloads in the guest.

v1 RFC[1] was posted in December 2023. The main disagreement was in the
implementation where the patch was making scheduling policy decisions
in kvm and kvm is not the right place to do it. The suggestion was to
move the polcy decisions outside of kvm and let kvm only handle the
notifications needed to make the policy decisions. This patch series is
an iterative step towards implementing the feature as a layered
design where the policy could be implemented outside of kvm as a
kernel built-in, a kernel module or a bpf program.

This design comprises mainly of 4 components:

- pvsched driver: Implements the scheduling policies. Register with
    host with a set of callbacks that hypervisor(kvm) can use to notify
    vcpu events that the driver is interested in. The callback will be
    passed in the address of shared memory so that the driver can get
    scheduling information shared by the guest and also update the
    scheduling policies set by the driver.
- kvm component: Selects the pvsched driver for a guest and notifies
    the driver via callbacks for events that the driver is interested
    in. Also interface with the guest in retreiving the shared memory
    region for sharing the scheduling information.
- host kernel component: Implements the APIs for:
    - pvsched driver for register/unregister to the host kernel, and
    - hypervisor for assingning/unassigning driver for guests.
- guest component: Implements a framework for sharing the scheduling
    information with the pvsched driver through kvm.

There is another component that we refer to as pvsched protocol. This
defines the details about shared memory layout, information sharing and
sheduling policy decisions. The protocol need not be part of the kernel
and can be defined separately based on the use case and requirements.
Both guest and the selected pvsched driver need to match the protocol
for the feature to work. Protocol shall be identified by a name and a
possible versioning scheme. Guest will advertise the protocol and then
the hypervisor can assign the driver implementing the protocol if it is
registered in the host kernel.

This patch series only implements the first 3 components. Guest side
implementation and the protocol framework shall come as a separate
series once we finalize rest of the design.

This series also implements a sample bpf program and a kernel-builtin
pvsched drivers. They do not do any real stuff now, but just skeletons
to demonstrate the feature.

Rebased on 6.8.2.

[1]: https://lwn.net/Articles/955145/

Vineeth Pillai (Google) (5):
  pvsched: paravirt scheduling framework
  kvm: Implement the paravirt sched framework for kvm
  kvm: interface for managing pvsched driver for guest VMs
  pvsched: bpf support for pvsched
  selftests/bpf: sample implementation of a bpf pvsched driver.

 Kconfig                                       |   2 +
 arch/x86/kvm/Kconfig                          |  13 +
 arch/x86/kvm/x86.c                            |   3 +
 include/linux/kvm_host.h                      |  32 +++
 include/linux/pvsched.h                       | 102 +++++++
 include/uapi/linux/kvm.h                      |   6 +
 kernel/bpf/bpf_struct_ops_types.h             |   4 +
 kernel/sysctl.c                               |  27 ++
 .../testing/selftests/bpf/progs/bpf_pvsched.c |  37 +++
 virt/Makefile                                 |   2 +-
 virt/kvm/kvm_main.c                           | 265 ++++++++++++++++++
 virt/pvsched/Kconfig                          |  12 +
 virt/pvsched/Makefile                         |   2 +
 virt/pvsched/pvsched.c                        | 215 ++++++++++++++
 virt/pvsched/pvsched_bpf.c                    | 141 ++++++++++
 15 files changed, 862 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/pvsched.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_pvsched.c
 create mode 100644 virt/pvsched/Kconfig
 create mode 100644 virt/pvsched/Makefile
 create mode 100644 virt/pvsched/pvsched.c
 create mode 100644 virt/pvsched/pvsched_bpf.c

Comments

Vineeth Remanan Pillai April 8, 2024, 1:54 p.m. UTC | #1
Sorry I missed sched_ext folks, adding them as well.

Thanks,
Vineeth


On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google)
<vineeth@bitbyteword.org> wrote:
>
> Double scheduling is a concern with virtualization hosts where the host
> schedules vcpus without knowing whats run by the vcpu and guest schedules
> tasks without knowing where the vcpu is physically running. This causes
> issues related to latencies, power consumption, resource utilization
> etc. An ideal solution would be to have a cooperative scheduling
> framework where the guest and host shares scheduling related information
> and makes an educated scheduling decision to optimally handle the
> workloads. As a first step, we are taking a stab at reducing latencies
> for latency sensitive workloads in the guest.
>
> v1 RFC[1] was posted in December 2023. The main disagreement was in the
> implementation where the patch was making scheduling policy decisions
> in kvm and kvm is not the right place to do it. The suggestion was to
> move the polcy decisions outside of kvm and let kvm only handle the
> notifications needed to make the policy decisions. This patch series is
> an iterative step towards implementing the feature as a layered
> design where the policy could be implemented outside of kvm as a
> kernel built-in, a kernel module or a bpf program.
>
> This design comprises mainly of 4 components:
>
> - pvsched driver: Implements the scheduling policies. Register with
>     host with a set of callbacks that hypervisor(kvm) can use to notify
>     vcpu events that the driver is interested in. The callback will be
>     passed in the address of shared memory so that the driver can get
>     scheduling information shared by the guest and also update the
>     scheduling policies set by the driver.
> - kvm component: Selects the pvsched driver for a guest and notifies
>     the driver via callbacks for events that the driver is interested
>     in. Also interface with the guest in retreiving the shared memory
>     region for sharing the scheduling information.
> - host kernel component: Implements the APIs for:
>     - pvsched driver for register/unregister to the host kernel, and
>     - hypervisor for assingning/unassigning driver for guests.
> - guest component: Implements a framework for sharing the scheduling
>     information with the pvsched driver through kvm.
>
> There is another component that we refer to as pvsched protocol. This
> defines the details about shared memory layout, information sharing and
> sheduling policy decisions. The protocol need not be part of the kernel
> and can be defined separately based on the use case and requirements.
> Both guest and the selected pvsched driver need to match the protocol
> for the feature to work. Protocol shall be identified by a name and a
> possible versioning scheme. Guest will advertise the protocol and then
> the hypervisor can assign the driver implementing the protocol if it is
> registered in the host kernel.
>
> This patch series only implements the first 3 components. Guest side
> implementation and the protocol framework shall come as a separate
> series once we finalize rest of the design.
>
> This series also implements a sample bpf program and a kernel-builtin
> pvsched drivers. They do not do any real stuff now, but just skeletons
> to demonstrate the feature.
>
> Rebased on 6.8.2.
>
> [1]: https://lwn.net/Articles/955145/
>
> Vineeth Pillai (Google) (5):
>   pvsched: paravirt scheduling framework
>   kvm: Implement the paravirt sched framework for kvm
>   kvm: interface for managing pvsched driver for guest VMs
>   pvsched: bpf support for pvsched
>   selftests/bpf: sample implementation of a bpf pvsched driver.
>
>  Kconfig                                       |   2 +
>  arch/x86/kvm/Kconfig                          |  13 +
>  arch/x86/kvm/x86.c                            |   3 +
>  include/linux/kvm_host.h                      |  32 +++
>  include/linux/pvsched.h                       | 102 +++++++
>  include/uapi/linux/kvm.h                      |   6 +
>  kernel/bpf/bpf_struct_ops_types.h             |   4 +
>  kernel/sysctl.c                               |  27 ++
>  .../testing/selftests/bpf/progs/bpf_pvsched.c |  37 +++
>  virt/Makefile                                 |   2 +-
>  virt/kvm/kvm_main.c                           | 265 ++++++++++++++++++
>  virt/pvsched/Kconfig                          |  12 +
>  virt/pvsched/Makefile                         |   2 +
>  virt/pvsched/pvsched.c                        | 215 ++++++++++++++
>  virt/pvsched/pvsched_bpf.c                    | 141 ++++++++++
>  15 files changed, 862 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/pvsched.h
>  create mode 100644 tools/testing/selftests/bpf/progs/bpf_pvsched.c
>  create mode 100644 virt/pvsched/Kconfig
>  create mode 100644 virt/pvsched/Makefile
>  create mode 100644 virt/pvsched/pvsched.c
>  create mode 100644 virt/pvsched/pvsched_bpf.c
>
> --
> 2.40.1
>
Sean Christopherson May 1, 2024, 3:29 p.m. UTC | #2
On Wed, Apr 03, 2024, Vineeth Pillai (Google) wrote:
> Double scheduling is a concern with virtualization hosts where the host
> schedules vcpus without knowing whats run by the vcpu and guest schedules
> tasks without knowing where the vcpu is physically running. This causes
> issues related to latencies, power consumption, resource utilization
> etc. An ideal solution would be to have a cooperative scheduling
> framework where the guest and host shares scheduling related information
> and makes an educated scheduling decision to optimally handle the
> workloads. As a first step, we are taking a stab at reducing latencies
> for latency sensitive workloads in the guest.
> 
> v1 RFC[1] was posted in December 2023. The main disagreement was in the
> implementation where the patch was making scheduling policy decisions
> in kvm and kvm is not the right place to do it. The suggestion was to
> move the polcy decisions outside of kvm and let kvm only handle the
> notifications needed to make the policy decisions. This patch series is
> an iterative step towards implementing the feature as a layered
> design where the policy could be implemented outside of kvm as a
> kernel built-in, a kernel module or a bpf program.
> 
> This design comprises mainly of 4 components:
> 
> - pvsched driver: Implements the scheduling policies. Register with
>     host with a set of callbacks that hypervisor(kvm) can use to notify
>     vcpu events that the driver is interested in. The callback will be
>     passed in the address of shared memory so that the driver can get
>     scheduling information shared by the guest and also update the
>     scheduling policies set by the driver.
> - kvm component: Selects the pvsched driver for a guest and notifies
>     the driver via callbacks for events that the driver is interested
>     in. Also interface with the guest in retreiving the shared memory
>     region for sharing the scheduling information.
> - host kernel component: Implements the APIs for:
>     - pvsched driver for register/unregister to the host kernel, and
>     - hypervisor for assingning/unassigning driver for guests.
> - guest component: Implements a framework for sharing the scheduling
>     information with the pvsched driver through kvm.

Roughly summarazing an off-list discussion.
 
 - Discovery schedulers should be handled outside of KVM and the kernel, e.g.
   similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest.

 - "Negotiating" features/hooks should also be handled outside of the kernel,
   e.g. similar to how VirtIO devices negotiate features between host and guest.

 - Pushing PV scheduler entities to KVM should either be done through an exported
   API, e.g. if the scheduler is provided by a separate kernel module, or by a
   KVM or VM ioctl() (especially if the desire is to have per-VM schedulers).

I think those were the main takeaways?  Vineeth and Joel, please chime in on
anything I've missed or misremembered.
 
The other reason I'm bringing this discussion back on-list is that I (very) briefly
discussed this with Paolo, and he pointed out the proposed rseq-based mechanism
that would allow userspace to request an extended time slice[*], and that if that
landed it would be easy-ish to reuse the interface for KVM's steal_time PV API.

I see that you're both on that thread, so presumably you're already aware of the
idea, but I wanted to bring it up here to make sure that we aren't trying to
design something that's more complex than is needed.

Specifically, if the guest has a generic way to request an extended time slice
(or boost its priority?), would that address your use cases?  Or rather, how close
does it get you?  E.g. the guest will have no way of requesting a larger time
slice or boosting priority when an event is _pending_ but not yet receiveed by
the guest, but is that actually problematic in practice?

[*] https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home
Vineeth Remanan Pillai May 2, 2024, 1:42 p.m. UTC | #3
> > This design comprises mainly of 4 components:
> >
> > - pvsched driver: Implements the scheduling policies. Register with
> >     host with a set of callbacks that hypervisor(kvm) can use to notify
> >     vcpu events that the driver is interested in. The callback will be
> >     passed in the address of shared memory so that the driver can get
> >     scheduling information shared by the guest and also update the
> >     scheduling policies set by the driver.
> > - kvm component: Selects the pvsched driver for a guest and notifies
> >     the driver via callbacks for events that the driver is interested
> >     in. Also interface with the guest in retreiving the shared memory
> >     region for sharing the scheduling information.
> > - host kernel component: Implements the APIs for:
> >     - pvsched driver for register/unregister to the host kernel, and
> >     - hypervisor for assingning/unassigning driver for guests.
> > - guest component: Implements a framework for sharing the scheduling
> >     information with the pvsched driver through kvm.
>
> Roughly summarazing an off-list discussion.
>
>  - Discovery schedulers should be handled outside of KVM and the kernel, e.g.
>    similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest.
>
>  - "Negotiating" features/hooks should also be handled outside of the kernel,
>    e.g. similar to how VirtIO devices negotiate features between host and guest.
>
>  - Pushing PV scheduler entities to KVM should either be done through an exported
>    API, e.g. if the scheduler is provided by a separate kernel module, or by a
>    KVM or VM ioctl() (especially if the desire is to have per-VM schedulers).
>
> I think those were the main takeaways?  Vineeth and Joel, please chime in on
> anything I've missed or misremembered.
>
Thanks for the brief about the offlist discussion, all the points are
captured, just some minor additions. v2 implementation removed the
scheduling policies outside of kvm to a separate entity called pvsched
driver and could be implemented as a kernel module or bpf program. But
the handshake between guest and host to decide on what pvsched driver
to attach was still going through kvm. So it was suggested to move
this handshake(discovery and negotiation) outside of kvm. The idea is
to have a virtual device exposed by the VMM which would take care of
the handshake. Guest driver for this device would talk to the device
to understand the pvsched details on the host and pass the shared
memory details. Once the handshake is completed, the device is
responsible for loading the pvsched driver(bpf program or kernel
module responsible for implementing the policies). The pvsched driver
will register to the trace points exported by kvm and handle the
callbacks from then on. The scheduling will be taken care of by the
host scheduler, pvsched driver on host is responsible only for setting
the policies(placement, priorities etc).

With the above approach, the only change in kvm would be the internal
tracepoints for pvsched. Host kernel will also be unchanged and all
the complexities move to the VMM and the pvsched driver. Guest kernel
will have a new driver to talk to the virtual pvsched device and this
driver would hook into the guest kernel for passing scheduling
information to the host(via tracepoints).

> The other reason I'm bringing this discussion back on-list is that I (very) briefly
> discussed this with Paolo, and he pointed out the proposed rseq-based mechanism
> that would allow userspace to request an extended time slice[*], and that if that
> landed it would be easy-ish to reuse the interface for KVM's steal_time PV API.
>
> I see that you're both on that thread, so presumably you're already aware of the
> idea, but I wanted to bring it up here to make sure that we aren't trying to
> design something that's more complex than is needed.
>
> Specifically, if the guest has a generic way to request an extended time slice
> (or boost its priority?), would that address your use cases?  Or rather, how close
> does it get you?  E.g. the guest will have no way of requesting a larger time
> slice or boosting priority when an event is _pending_ but not yet receiveed by
> the guest, but is that actually problematic in practice?
>
> [*] https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home
>
Thanks for bringing this up. We were also very much interested in this
feature and were planning to use the pvmem shared memory  instead of
rseq framework for guests. The motivation of paravirt scheduling
framework was a bit broader than the latency issues and hence we were
proposing a bit more complex design. Other than the use case for
temporarily extending the time slice of vcpus, we were also looking at
vcpu placements on physical cpus, educated decisions that could be
made by guest scheduler if it has a picture of host cpu load etc.
Having a paravirt mechanism to share scheduling information would
benefit in such cases. Once we have this framework setup, the policy
implementation on guest and host could be taken care of by other
entities like BPF programs, modules or schedulers like sched_ext.

We are working on a v3 incorporating the above ideas and would shortly
be posting a design RFC soon. Thanks for all the help and inputs on
this.

Thanks,
Vineeth
Vineeth Remanan Pillai June 24, 2024, 11:01 a.m. UTC | #4
> > Roughly summarazing an off-list discussion.
> >
> >  - Discovery schedulers should be handled outside of KVM and the kernel, e.g.
> >    similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest.
> >
> >  - "Negotiating" features/hooks should also be handled outside of the kernel,
> >    e.g. similar to how VirtIO devices negotiate features between host and guest.
> >
> >  - Pushing PV scheduler entities to KVM should either be done through an exported
> >    API, e.g. if the scheduler is provided by a separate kernel module, or by a
> >    KVM or VM ioctl() (especially if the desire is to have per-VM schedulers).
> >
> > I think those were the main takeaways?  Vineeth and Joel, please chime in on
> > anything I've missed or misremembered.
> >
> Thanks for the brief about the offlist discussion, all the points are
> captured, just some minor additions. v2 implementation removed the
> scheduling policies outside of kvm to a separate entity called pvsched
> driver and could be implemented as a kernel module or bpf program. But
> the handshake between guest and host to decide on what pvsched driver
> to attach was still going through kvm. So it was suggested to move
> this handshake(discovery and negotiation) outside of kvm. The idea is
> to have a virtual device exposed by the VMM which would take care of
> the handshake. Guest driver for this device would talk to the device
> to understand the pvsched details on the host and pass the shared
> memory details. Once the handshake is completed, the device is
> responsible for loading the pvsched driver(bpf program or kernel
> module responsible for implementing the policies). The pvsched driver
> will register to the trace points exported by kvm and handle the
> callbacks from then on. The scheduling will be taken care of by the
> host scheduler, pvsched driver on host is responsible only for setting
> the policies(placement, priorities etc).
>
> With the above approach, the only change in kvm would be the internal
> tracepoints for pvsched. Host kernel will also be unchanged and all
> the complexities move to the VMM and the pvsched driver. Guest kernel
> will have a new driver to talk to the virtual pvsched device and this
> driver would hook into the guest kernel for passing scheduling
> information to the host(via tracepoints).
>
Noting down the recent offlist discussion and details of our response.

Based on the previous discussions, we had come up with a modified
design focusing on minimum kvm changes. The design is as follows:
- Guest and host share scheduling information via shared memory
region. Details of the layout of the memory region, information shared
and actions and policies are defined by the pvsched protocol. And this
protocol is implemented by a BPF program or a kernel module.
- Host exposes a virtual device(pvsched device to the guest). This
device is the mechanism for host and guest for handshake and
negotiation to reach a decision on the pvsched protocol to use. The
virtual device is implemented in the VMM in userland as it doesn't
come in the performance critical path.
- Guest loads a pvsched driver during device enumeration. the driver
initiates the protocol handshake and negotiation with the host and
decides on the protocol. This driver creates a per-cpu shared memory
region and shares the GFN with the device in the host. Guest also
loads the BPF program that implements the protocol in the guest.
- Once the VMM has all the information needed(per-cpu shared memory
GFN, vcpu task pids etc), it loads the BPF program which implements
the protocol on the host.
- BPF program on the host registers the trace points in kvm to get
callbacks on interested events like VMENTER, VMEXIT, interrupt
injection etc. Similarly, the guest BPF program registers tracepoints
in the guest kernel for interested events like sched wakeup, sched
switch, enqueue, dequeue, irq entry/exit etc.

The above design is minimally invasive to the kvm and core kernel and
implements the protocol as loadable programs and protocol handshake
and negotiation through the virtual device framework. Protocol
implementation takes care of information sharing and policy
enforcements and scheduler handles the actual scheduling decisions.
Sample policy implementation(boosting for latency sensitive workloads
as an example) could be included in the kernel for reference.

We had an offlist discussion about the above design and a couple of
ideas were suggested as an alternative. We had taken an action item to
study the alternatives for the feasibility. Rest of the mail lists the
use cases(not conclusive) and our feasibility investigations.

Existing use cases
-------------------------

- A latency sensitive workload on the guest might need more than one
time slice to complete, but should not block any higher priority task
in the host. In our design, the latency sensitive workload shares its
priority requirements to host(RT priority, cfs nice value etc). Host
implementation of the protocol sets the priority of the vcpu task
accordingly so that the host scheduler can make an educated decision
on the next task to run. This makes sure that host processes and vcpu
tasks compete fairly for the cpu resource.
- Guest should be able to notify the host that it is running a lower
priority task so that the host can reschedule it if needed. As
mentioned before, the guest shares the priority with the host and the
host takes a better scheduling decision.
- Proactive vcpu boosting for events like interrupt injection.
Depending on the guest for boost request might be too late as the vcpu
might not be scheduled to run even after interrupt injection. Host
implementation of the protocol boosts the vcpu tasks priority so that
it gets a better chance of immediately being scheduled and guest can
handle the interrupt with minimal latency. Once the guest is done
handling the interrupt, it can notify the host and lower the priority
of the vcpu task.
- Guests which assign specialized tasks to specific vcpus can share
that information with the host so that host can try to avoid
colocation of those cpus in a single physical cpu. for eg: there are
interrupt pinning use cases where specific cpus are chosen to handle
critical interrupts and passing this information to the host could be
useful.
- Another use case is the sharing of cpu capacity details between
guest and host. Sharing the host cpu's load with the guest will enable
the guest to schedule latency sensitive tasks on the best possible
vcpu. This could be partially achievable by steal time, but steal time
is more apparent on busy vcpus. There are workloads which are mostly
sleepers, but wake up intermittently to serve short latency sensitive
workloads. input event handlers in chrome is one such example.

Data from the prototype implementation shows promising improvement in
reducing latencies. Data was shared in the v1 cover letter. We have
not implemented the capacity based placement policies yet, but plan to
do that soon and have some real numbers to share.

Ideas brought up during offlist discussion
-------------------------------------------------------

1. rseq based timeslice extension mechanism[1]

While the rseq based mechanism helps in giving the vcpu task one more
time slice, it will not help in the other use cases. We had a chat
with Steve and the rseq mechanism was mainly for improving lock
contention and would not work best with vcpu boosting considering all
the use cases above. RT or high priority tasks in the VM would often
need more than one time slice to complete its work and at the same,
should not be hurting the host workloads. The goal for the above use
cases is not requesting an extra slice, but to modify the priority in
such a way that host processes and guest processes get a fair way to
compete for cpu resources. This also means that vcpu task can request
a lower priority when it is running lower priority tasks in the VM.

2. vDSO approach
Regarding the vDSO approach, we had a look at that and feel that
without a major redesign of vDSO, it might be difficult to achieve the
requirements. vDSO is currently implemented as a shared read-only
memory region with the processes. For this to work with
virtualization, we would need to map a similar region to the guest and
it has to be read-write. This is more or less what we are also
proposing, but with minimal changes in the core kernel. With the
current design, the shared memory region would be the responsibility
of the virtual pvsched device framework.

Sorry for the long mail. Please have a look and let us know your thoughts :-)

Thanks,

[1]: https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home/
Joel Fernandes July 12, 2024, 12:57 p.m. UTC | #5
On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
> > > Roughly summarazing an off-list discussion.
> > >
> > >  - Discovery schedulers should be handled outside of KVM and the kernel, e.g.
> > >    similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest.
> > >
> > >  - "Negotiating" features/hooks should also be handled outside of the kernel,
> > >    e.g. similar to how VirtIO devices negotiate features between host and guest.
> > >
> > >  - Pushing PV scheduler entities to KVM should either be done through an exported
> > >    API, e.g. if the scheduler is provided by a separate kernel module, or by a
> > >    KVM or VM ioctl() (especially if the desire is to have per-VM schedulers).
> > >
> > > I think those were the main takeaways?  Vineeth and Joel, please chime in on
> > > anything I've missed or misremembered.
> > >
> > Thanks for the brief about the offlist discussion, all the points are
> > captured, just some minor additions. v2 implementation removed the
> > scheduling policies outside of kvm to a separate entity called pvsched
> > driver and could be implemented as a kernel module or bpf program. But
> > the handshake between guest and host to decide on what pvsched driver
> > to attach was still going through kvm. So it was suggested to move
> > this handshake(discovery and negotiation) outside of kvm. The idea is
> > to have a virtual device exposed by the VMM which would take care of
> > the handshake. Guest driver for this device would talk to the device
> > to understand the pvsched details on the host and pass the shared
> > memory details. Once the handshake is completed, the device is
> > responsible for loading the pvsched driver(bpf program or kernel
> > module responsible for implementing the policies). The pvsched driver
> > will register to the trace points exported by kvm and handle the
> > callbacks from then on. The scheduling will be taken care of by the
> > host scheduler, pvsched driver on host is responsible only for setting
> > the policies(placement, priorities etc).
> >
> > With the above approach, the only change in kvm would be the internal
> > tracepoints for pvsched. Host kernel will also be unchanged and all
> > the complexities move to the VMM and the pvsched driver. Guest kernel
> > will have a new driver to talk to the virtual pvsched device and this
> > driver would hook into the guest kernel for passing scheduling
> > information to the host(via tracepoints).
> >
> Noting down the recent offlist discussion and details of our response.
> 
> Based on the previous discussions, we had come up with a modified
> design focusing on minimum kvm changes. The design is as follows:
> - Guest and host share scheduling information via shared memory
> region. Details of the layout of the memory region, information shared
> and actions and policies are defined by the pvsched protocol. And this
> protocol is implemented by a BPF program or a kernel module.
> - Host exposes a virtual device(pvsched device to the guest). This
> device is the mechanism for host and guest for handshake and
> negotiation to reach a decision on the pvsched protocol to use. The
> virtual device is implemented in the VMM in userland as it doesn't
> come in the performance critical path.
> - Guest loads a pvsched driver during device enumeration. the driver
> initiates the protocol handshake and negotiation with the host and
> decides on the protocol. This driver creates a per-cpu shared memory
> region and shares the GFN with the device in the host. Guest also
> loads the BPF program that implements the protocol in the guest.
> - Once the VMM has all the information needed(per-cpu shared memory
> GFN, vcpu task pids etc), it loads the BPF program which implements
> the protocol on the host.
> - BPF program on the host registers the trace points in kvm to get
> callbacks on interested events like VMENTER, VMEXIT, interrupt
> injection etc. Similarly, the guest BPF program registers tracepoints
> in the guest kernel for interested events like sched wakeup, sched
> switch, enqueue, dequeue, irq entry/exit etc.
> 
> The above design is minimally invasive to the kvm and core kernel and
> implements the protocol as loadable programs and protocol handshake
> and negotiation through the virtual device framework. Protocol
> implementation takes care of information sharing and policy
> enforcements and scheduler handles the actual scheduling decisions.
> Sample policy implementation(boosting for latency sensitive workloads
> as an example) could be included in the kernel for reference.
> 
> We had an offlist discussion about the above design and a couple of
> ideas were suggested as an alternative. We had taken an action item to
> study the alternatives for the feasibility. Rest of the mail lists the
> use cases(not conclusive) and our feasibility investigations.
> 
> Existing use cases
> -------------------------
> 
> - A latency sensitive workload on the guest might need more than one
> time slice to complete, but should not block any higher priority task
> in the host. In our design, the latency sensitive workload shares its
> priority requirements to host(RT priority, cfs nice value etc). Host
> implementation of the protocol sets the priority of the vcpu task
> accordingly so that the host scheduler can make an educated decision
> on the next task to run. This makes sure that host processes and vcpu
> tasks compete fairly for the cpu resource.
> - Guest should be able to notify the host that it is running a lower
> priority task so that the host can reschedule it if needed. As
> mentioned before, the guest shares the priority with the host and the
> host takes a better scheduling decision.
> - Proactive vcpu boosting for events like interrupt injection.
> Depending on the guest for boost request might be too late as the vcpu
> might not be scheduled to run even after interrupt injection. Host
> implementation of the protocol boosts the vcpu tasks priority so that
> it gets a better chance of immediately being scheduled and guest can
> handle the interrupt with minimal latency. Once the guest is done
> handling the interrupt, it can notify the host and lower the priority
> of the vcpu task.
> - Guests which assign specialized tasks to specific vcpus can share
> that information with the host so that host can try to avoid
> colocation of those cpus in a single physical cpu. for eg: there are
> interrupt pinning use cases where specific cpus are chosen to handle
> critical interrupts and passing this information to the host could be
> useful.
> - Another use case is the sharing of cpu capacity details between
> guest and host. Sharing the host cpu's load with the guest will enable
> the guest to schedule latency sensitive tasks on the best possible
> vcpu. This could be partially achievable by steal time, but steal time
> is more apparent on busy vcpus. There are workloads which are mostly
> sleepers, but wake up intermittently to serve short latency sensitive
> workloads. input event handlers in chrome is one such example.
> 
> Data from the prototype implementation shows promising improvement in
> reducing latencies. Data was shared in the v1 cover letter. We have
> not implemented the capacity based placement policies yet, but plan to
> do that soon and have some real numbers to share.
> 
> Ideas brought up during offlist discussion
> -------------------------------------------------------
> 
> 1. rseq based timeslice extension mechanism[1]
> 
> While the rseq based mechanism helps in giving the vcpu task one more
> time slice, it will not help in the other use cases. We had a chat
> with Steve and the rseq mechanism was mainly for improving lock
> contention and would not work best with vcpu boosting considering all
> the use cases above. RT or high priority tasks in the VM would often
> need more than one time slice to complete its work and at the same,
> should not be hurting the host workloads. The goal for the above use
> cases is not requesting an extra slice, but to modify the priority in
> such a way that host processes and guest processes get a fair way to
> compete for cpu resources. This also means that vcpu task can request
> a lower priority when it is running lower priority tasks in the VM.

I was looking at the rseq on request from the KVM call, however it does not
make sense to me yet how to expose the rseq area via the Guest VA to the host
kernel.  rseq is for userspace to kernel, not VM to kernel.

Steven Rostedt said as much as well, thoughts? Add Mathieu as well.

This idea seems to suffer from the same vDSO over-engineering below, rseq
does not seem to fit.

Steven Rostedt told me, what we instead need is a tracepoint callback in a
driver, that does the boosting.

 - Joel



> 
> 2. vDSO approach
> Regarding the vDSO approach, we had a look at that and feel that
> without a major redesign of vDSO, it might be difficult to achieve the
> requirements. vDSO is currently implemented as a shared read-only
> memory region with the processes. For this to work with
> virtualization, we would need to map a similar region to the guest and
> it has to be read-write. This is more or less what we are also
> proposing, but with minimal changes in the core kernel. With the
> current design, the shared memory region would be the responsibility
> of the virtual pvsched device framework.
> 
> Sorry for the long mail. Please have a look and let us know your thoughts :-)
> 
> Thanks,
> 
> [1]: https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home/
Mathieu Desnoyers July 12, 2024, 2:09 p.m. UTC | #6
On 2024-07-12 08:57, Joel Fernandes wrote:
> On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
[...]
>> Existing use cases
>> -------------------------
>>
>> - A latency sensitive workload on the guest might need more than one
>> time slice to complete, but should not block any higher priority task
>> in the host. In our design, the latency sensitive workload shares its
>> priority requirements to host(RT priority, cfs nice value etc). Host
>> implementation of the protocol sets the priority of the vcpu task
>> accordingly so that the host scheduler can make an educated decision
>> on the next task to run. This makes sure that host processes and vcpu
>> tasks compete fairly for the cpu resource.

AFAIU, the information you need to convey to achieve this is the priority
of the task within the guest. This information need to reach the host
scheduler to make informed decision.

One thing that is unclear about this is what is the acceptable
overhead/latency to push this information from guest to host ?
Is an hypercall OK or does it need to be exchanged over a memory
mapping shared between guest and host ?

Hypercalls provide simple ABIs across guest/host, and they allow
the guest to immediately notify the host (similar to an interrupt).

Shared memory mapping will require a carefully crafted ABI layout,
and will only allow the host to use the information provided when
the host runs. Therefore, if the choice is to share this information
only through shared memory, the host scheduler will only be able to
read it when it runs, so in hypercall, interrupt, and so on.

>> - Guest should be able to notify the host that it is running a lower
>> priority task so that the host can reschedule it if needed. As
>> mentioned before, the guest shares the priority with the host and the
>> host takes a better scheduling decision.

It is unclear to me whether this information needs to be "pushed"
from guest to host (e.g. hypercall) in a way that allows the host
to immediately act on this information, or if it is OK to have the
host read this information when its scheduler happens to run.

>> - Proactive vcpu boosting for events like interrupt injection.
>> Depending on the guest for boost request might be too late as the vcpu
>> might not be scheduled to run even after interrupt injection. Host
>> implementation of the protocol boosts the vcpu tasks priority so that
>> it gets a better chance of immediately being scheduled and guest can
>> handle the interrupt with minimal latency. Once the guest is done
>> handling the interrupt, it can notify the host and lower the priority
>> of the vcpu task.

This appears to be a scenario where the host sets a "high priority", and
the guest clears it when it is done with the irq handler. I guess it can
be done either ways (hypercall or shared memory), but the choice would
depend on the parameters identified above: acceptable overhead vs acceptable
latency to inform the host scheduler.

>> - Guests which assign specialized tasks to specific vcpus can share
>> that information with the host so that host can try to avoid
>> colocation of those cpus in a single physical cpu. for eg: there are
>> interrupt pinning use cases where specific cpus are chosen to handle
>> critical interrupts and passing this information to the host could be
>> useful.

How frequently is this topology expected to change ? Is it something that
is set once when the guest starts and then is fixed ? How often it changes
will likely affect the tradeoffs here.

>> - Another use case is the sharing of cpu capacity details between
>> guest and host. Sharing the host cpu's load with the guest will enable
>> the guest to schedule latency sensitive tasks on the best possible
>> vcpu. This could be partially achievable by steal time, but steal time
>> is more apparent on busy vcpus. There are workloads which are mostly
>> sleepers, but wake up intermittently to serve short latency sensitive
>> workloads. input event handlers in chrome is one such example.

OK so for this use-case information goes the other way around: from host
to guest. Here the shared mapping seems better than polling the state
through an hypercall.

>>
>> Data from the prototype implementation shows promising improvement in
>> reducing latencies. Data was shared in the v1 cover letter. We have
>> not implemented the capacity based placement policies yet, but plan to
>> do that soon and have some real numbers to share.
>>
>> Ideas brought up during offlist discussion
>> -------------------------------------------------------
>>
>> 1. rseq based timeslice extension mechanism[1]
>>
>> While the rseq based mechanism helps in giving the vcpu task one more
>> time slice, it will not help in the other use cases. We had a chat
>> with Steve and the rseq mechanism was mainly for improving lock
>> contention and would not work best with vcpu boosting considering all
>> the use cases above. RT or high priority tasks in the VM would often
>> need more than one time slice to complete its work and at the same,
>> should not be hurting the host workloads. The goal for the above use
>> cases is not requesting an extra slice, but to modify the priority in
>> such a way that host processes and guest processes get a fair way to
>> compete for cpu resources. This also means that vcpu task can request
>> a lower priority when it is running lower priority tasks in the VM.
> 
> I was looking at the rseq on request from the KVM call, however it does not
> make sense to me yet how to expose the rseq area via the Guest VA to the host
> kernel.  rseq is for userspace to kernel, not VM to kernel.
> 
> Steven Rostedt said as much as well, thoughts? Add Mathieu as well.

I'm not sure that rseq would help at all here, but I think we may want to
borrow concepts of data sitting in shared memory across privilege levels
and apply them to VMs.

If some of the ideas end up being useful *outside* of the context of VMs,
then I'd be willing to consider adding fields to rseq. But as long as it is
VM-specific, I suspect you'd be better with dedicated per-vcpu pages which
you can safely share across host/guest kernels.

> 
> This idea seems to suffer from the same vDSO over-engineering below, rseq
> does not seem to fit.
> 
> Steven Rostedt told me, what we instead need is a tracepoint callback in a
> driver, that does the boosting.

I utterly dislike changing the system behavior through tracepoints. They were
designed to observe the system, not modify its behavior. If people start abusing
them, then subsystem maintainers will stop adding them. Please don't do that.
Add a notifier or think about integrating what you are planning to add into the
driver instead.

Thanks,

Mathieu
Sean Christopherson July 12, 2024, 2:48 p.m. UTC | #7
On Fri, Jul 12, 2024, Mathieu Desnoyers wrote:
> On 2024-07-12 08:57, Joel Fernandes wrote:
> > On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
> [...]
> > > Existing use cases
> > > -------------------------
> > > 
> > > - A latency sensitive workload on the guest might need more than one
> > > time slice to complete, but should not block any higher priority task
> > > in the host. In our design, the latency sensitive workload shares its
> > > priority requirements to host(RT priority, cfs nice value etc). Host
> > > implementation of the protocol sets the priority of the vcpu task
> > > accordingly so that the host scheduler can make an educated decision
> > > on the next task to run. This makes sure that host processes and vcpu
> > > tasks compete fairly for the cpu resource.
> 
> AFAIU, the information you need to convey to achieve this is the priority
> of the task within the guest. This information need to reach the host
> scheduler to make informed decision.
> 
> One thing that is unclear about this is what is the acceptable
> overhead/latency to push this information from guest to host ?
> Is an hypercall OK or does it need to be exchanged over a memory
> mapping shared between guest and host ?
> 
> Hypercalls provide simple ABIs across guest/host, and they allow
> the guest to immediately notify the host (similar to an interrupt).

Hypercalls have myriad problems.  They require a VM-Exit, which largely defeats
the purpose of boosting the vCPU priority for performance reasons.  They don't
allow for delegation as there's no way for the hypervisor to know if a hypercall
from guest userspace should be allowed, versus anything memory based where the
ability for guest userspace to access the memory demonstrates permission (else
the guest kernel wouldn't have mapped the memory into userspace).

> > > Ideas brought up during offlist discussion
> > > -------------------------------------------------------
> > > 
> > > 1. rseq based timeslice extension mechanism[1]
> > > 
> > > While the rseq based mechanism helps in giving the vcpu task one more
> > > time slice, it will not help in the other use cases. We had a chat
> > > with Steve and the rseq mechanism was mainly for improving lock
> > > contention and would not work best with vcpu boosting considering all
> > > the use cases above. RT or high priority tasks in the VM would often
> > > need more than one time slice to complete its work and at the same,
> > > should not be hurting the host workloads. The goal for the above use
> > > cases is not requesting an extra slice, but to modify the priority in
> > > such a way that host processes and guest processes get a fair way to
> > > compete for cpu resources. This also means that vcpu task can request
> > > a lower priority when it is running lower priority tasks in the VM.

Then figure out a way to let userspace boot a task's priority without needing a
syscall.  vCPUs are not directly schedulable entities, the task doing KVM_RUN
on the vCPU fd is what the scheduler sees.  Any scheduling enhancement that
benefits vCPUs by definition can benefit userspace tasks.

> > I was looking at the rseq on request from the KVM call, however it does not
> > make sense to me yet how to expose the rseq area via the Guest VA to the host
> > kernel.  rseq is for userspace to kernel, not VM to kernel.

Any memory that is exposed to host userspace can be exposed to the guest.  Things
like this are implemented via "overlay" pages, where the guest asks host userspace
to map the magic page (rseq in this case) at GPA 'x'.  Userspace then creates a
memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
address of the page containing the rseq structure associated with the vCPU (in
pretty much every modern VMM, each vCPU has a dedicated task/thread).

A that point, the vCPU can read/write the rseq structure directly.

The reason us KVM folks are pushing y'all towards something like rseq is that
(again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU
is actually just priority boosting a task.  So rather than invent something
virtualization specific, invent a mechanism for priority boosting from userspace
without a syscall, and then extend it to the virtualization use case.

> > Steven Rostedt said as much as well, thoughts? Add Mathieu as well.
> 
> I'm not sure that rseq would help at all here, but I think we may want to
> borrow concepts of data sitting in shared memory across privilege levels
> and apply them to VMs.
> 
> If some of the ideas end up being useful *outside* of the context of VMs,

Modulo the assertion above that this is is about boosting priority instead of
requesting an extended time slice, this is essentially the same thing as the
"delay resched" discussion[*].  The only difference is that the vCPU is in a
critical section, e.q. IRQ handler, versus the userspace task being in a critical
section.

[*] https://lore.kernel.org/all/20231025054219.1acaa3dd@gandalf.local.home

> then I'd be willing to consider adding fields to rseq. But as long as it is
> VM-specific, I suspect you'd be better with dedicated per-vcpu pages which
> you can safely share across host/guest kernels.
Mathieu Desnoyers July 12, 2024, 3:32 p.m. UTC | #8
On 2024-07-12 10:48, Sean Christopherson wrote:
> On Fri, Jul 12, 2024, Mathieu Desnoyers wrote:
>> On 2024-07-12 08:57, Joel Fernandes wrote:
>>> On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
>> [...]
>>>> Existing use cases
>>>> -------------------------
>>>>
>>>> - A latency sensitive workload on the guest might need more than one
>>>> time slice to complete, but should not block any higher priority task
>>>> in the host. In our design, the latency sensitive workload shares its
>>>> priority requirements to host(RT priority, cfs nice value etc). Host
>>>> implementation of the protocol sets the priority of the vcpu task
>>>> accordingly so that the host scheduler can make an educated decision
>>>> on the next task to run. This makes sure that host processes and vcpu
>>>> tasks compete fairly for the cpu resource.
>>
>> AFAIU, the information you need to convey to achieve this is the priority
>> of the task within the guest. This information need to reach the host
>> scheduler to make informed decision.
>>
>> One thing that is unclear about this is what is the acceptable
>> overhead/latency to push this information from guest to host ?
>> Is an hypercall OK or does it need to be exchanged over a memory
>> mapping shared between guest and host ?
>>
>> Hypercalls provide simple ABIs across guest/host, and they allow
>> the guest to immediately notify the host (similar to an interrupt).
> 
> Hypercalls have myriad problems.  They require a VM-Exit, which largely defeats
> the purpose of boosting the vCPU priority for performance reasons.  They don't
> allow for delegation as there's no way for the hypervisor to know if a hypercall
> from guest userspace should be allowed, versus anything memory based where the
> ability for guest userspace to access the memory demonstrates permission (else
> the guest kernel wouldn't have mapped the memory into userspace).

OK, this answers my question above: the overhead of the hypercall pretty
much defeats the purpose of this priority boosting.

> 
>>>> Ideas brought up during offlist discussion
>>>> -------------------------------------------------------
>>>>
>>>> 1. rseq based timeslice extension mechanism[1]
>>>>
>>>> While the rseq based mechanism helps in giving the vcpu task one more
>>>> time slice, it will not help in the other use cases. We had a chat
>>>> with Steve and the rseq mechanism was mainly for improving lock
>>>> contention and would not work best with vcpu boosting considering all
>>>> the use cases above. RT or high priority tasks in the VM would often
>>>> need more than one time slice to complete its work and at the same,
>>>> should not be hurting the host workloads. The goal for the above use
>>>> cases is not requesting an extra slice, but to modify the priority in
>>>> such a way that host processes and guest processes get a fair way to
>>>> compete for cpu resources. This also means that vcpu task can request
>>>> a lower priority when it is running lower priority tasks in the VM.
> 
> Then figure out a way to let userspace boot a task's priority without needing a
> syscall.  vCPUs are not directly schedulable entities, the task doing KVM_RUN
> on the vCPU fd is what the scheduler sees.  Any scheduling enhancement that
> benefits vCPUs by definition can benefit userspace tasks.

Yes.

> 
>>> I was looking at the rseq on request from the KVM call, however it does not
>>> make sense to me yet how to expose the rseq area via the Guest VA to the host
>>> kernel.  rseq is for userspace to kernel, not VM to kernel.
> 
> Any memory that is exposed to host userspace can be exposed to the guest.  Things
> like this are implemented via "overlay" pages, where the guest asks host userspace
> to map the magic page (rseq in this case) at GPA 'x'.  Userspace then creates a
> memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
> address of the page containing the rseq structure associated with the vCPU (in
> pretty much every modern VMM, each vCPU has a dedicated task/thread).
> 
> A that point, the vCPU can read/write the rseq structure directly.

This helps me understand what you are trying to achieve. I disagree with
some aspects of the design you present above: mainly the lack of
isolation between the guest kernel and the host task doing the KVM_RUN.
We do not want to let the guest kernel store to rseq fields that would
result in getting the host task killed (e.g. a bogus rseq_cs pointer).
But this is something we can improve upon once we understand what we
are trying to achieve.

> 
> The reason us KVM folks are pushing y'all towards something like rseq is that
> (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU
> is actually just priority boosting a task.  So rather than invent something
> virtualization specific, invent a mechanism for priority boosting from userspace
> without a syscall, and then extend it to the virtualization use case.
> 
[...]

OK, so how about we expose "offsets" tuning the base values ?

- The task doing KVM_RUN, just like any other task, has its "priority"
   value as set by setpriority(2).

- We introduce two new fields in the per-thread struct rseq, which is
   mapped in the host task doing KVM_RUN and readable from the scheduler:

   - __s32 prio_offset; /* Priority offset to apply on the current task priority. */

   - __u64 vcpu_sched;  /* Pointer to a struct vcpu_sched in user-space */

     vcpu_sched would be a userspace pointer to a new vcpu_sched structure,
     which would be typically NULL except for tasks doing KVM_RUN. This would
     sit in its own pages per vcpu, which takes care of isolation between guest
     kernel and host process. Those would be RW by the guest kernel as
     well and contain e.g.:

     struct vcpu_sched {
         __u32 len;  /* Length of active fields. */

         __s32 prio_offset;
         __s32 cpu_capacity_offset;
         [...]
     };

So when the host kernel try to calculate the effective priority of a task
doing KVM_RUN, it would basically start from its current priority, and offset
by (rseq->prio_offset + rseq->vcpu_sched->prio_offset).

The cpu_capacity_offset would be populated by the host kernel and read by the
guest kernel scheduler for scheduling/migration decisions.

I'm certainly missing details about how priority offsets should be bounded for
given tasks. This could be an extension to setrlimit(2).

Thoughts ?

Thanks,

Mathieu
Sean Christopherson July 12, 2024, 4:14 p.m. UTC | #9
On Fri, Jul 12, 2024, Mathieu Desnoyers wrote:
> On 2024-07-12 10:48, Sean Christopherson wrote:
> > > > I was looking at the rseq on request from the KVM call, however it does not
> > > > make sense to me yet how to expose the rseq area via the Guest VA to the host
> > > > kernel.  rseq is for userspace to kernel, not VM to kernel.
> > 
> > Any memory that is exposed to host userspace can be exposed to the guest.  Things
> > like this are implemented via "overlay" pages, where the guest asks host userspace
> > to map the magic page (rseq in this case) at GPA 'x'.  Userspace then creates a
> > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
> > address of the page containing the rseq structure associated with the vCPU (in
> > pretty much every modern VMM, each vCPU has a dedicated task/thread).
> > 
> > A that point, the vCPU can read/write the rseq structure directly.
> 
> This helps me understand what you are trying to achieve. I disagree with
> some aspects of the design you present above: mainly the lack of
> isolation between the guest kernel and the host task doing the KVM_RUN.
> We do not want to let the guest kernel store to rseq fields that would
> result in getting the host task killed (e.g. a bogus rseq_cs pointer).

Yeah, exposing the full rseq structure to the guest is probably a terrible idea.
The above isn't intended to be a design, the goal is just to illustrate how an
rseq-like mechanism can be extended to the guest without needing virtualization
specific ABI and without needing new KVM functionality.

> But this is something we can improve upon once we understand what we
> are trying to achieve.
> 
> > 
> > The reason us KVM folks are pushing y'all towards something like rseq is that
> > (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU
> > is actually just priority boosting a task.  So rather than invent something
> > virtualization specific, invent a mechanism for priority boosting from userspace
> > without a syscall, and then extend it to the virtualization use case.
> > 
> [...]
> 
> OK, so how about we expose "offsets" tuning the base values ?
> 
> - The task doing KVM_RUN, just like any other task, has its "priority"
>   value as set by setpriority(2).
> 
> - We introduce two new fields in the per-thread struct rseq, which is
>   mapped in the host task doing KVM_RUN and readable from the scheduler:
> 
>   - __s32 prio_offset; /* Priority offset to apply on the current task priority. */
> 
>   - __u64 vcpu_sched;  /* Pointer to a struct vcpu_sched in user-space */

Ideally, there won't be a virtualization specific structure.  A vCPU specific
field might make sense (or it might not), but I really want to avoid defining a
structure that is unique to virtualization.  E.g. a userspace doing M:N scheduling
can likely benefit from any capacity hooks/information that would benefit a guest
scheduler.  I.e. rather than a vcpu_sched structure, have a user_sched structure
(or whatever name makes sense), and then have two struct pointers in rseq.

Though I'm skeptical that having two structs in play would be necessary or sane.
E.g. if both userspace and guest can adjust priority, then they'll need to coordinate
in order to avoid unexpected results.  I can definitely see wanting to let the
userspace VMM bound the priority of a vCPU, but that should be a relatively static
decision, i.e. can be done via syscall or something similarly "slow".

>     vcpu_sched would be a userspace pointer to a new vcpu_sched structure,
>     which would be typically NULL except for tasks doing KVM_RUN. This would
>     sit in its own pages per vcpu, which takes care of isolation between guest
>     kernel and host process. Those would be RW by the guest kernel as
>     well and contain e.g.:
> 
>     struct vcpu_sched {
>         __u32 len;  /* Length of active fields. */
> 
>         __s32 prio_offset;
>         __s32 cpu_capacity_offset;
>         [...]
>     };
> 
> So when the host kernel try to calculate the effective priority of a task
> doing KVM_RUN, it would basically start from its current priority, and offset
> by (rseq->prio_offset + rseq->vcpu_sched->prio_offset).
> 
> The cpu_capacity_offset would be populated by the host kernel and read by the
> guest kernel scheduler for scheduling/migration decisions.
> 
> I'm certainly missing details about how priority offsets should be bounded for
> given tasks. This could be an extension to setrlimit(2).
> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Steven Rostedt July 12, 2024, 4:24 p.m. UTC | #10
On Fri, 12 Jul 2024 10:09:03 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> > 
> > Steven Rostedt told me, what we instead need is a tracepoint callback in a
> > driver, that does the boosting.  
> 
> I utterly dislike changing the system behavior through tracepoints. They were
> designed to observe the system, not modify its behavior. If people start abusing
> them, then subsystem maintainers will stop adding them. Please don't do that.
> Add a notifier or think about integrating what you are planning to add into the
> driver instead.

I tend to agree that a notifier would be much better than using
tracepoints, but then I also think eBPF has already let that cat out of
the bag. :-p

All we need is a notifier that gets called at every VMEXIT.

The main issue that this is trying to solve is to boost the priority of
the guest without making the hypercall, so that it can quickly react
(lower the latency of reaction to an event). Now when the task is
unboosted, there's no avoiding of the hypercall as there's no other way
to tell the host that this vCPU should not be running at a higher
priority (the high priority may prevent schedules, or even checking the
new prio in the shared memory).

If there's a way to have a shared memory, via virtio or whatever, and
any notifier that gets called at any VMEXIT, then this is trivial to
implement.

-- Steve
Joel Fernandes July 12, 2024, 4:24 p.m. UTC | #11
On Fri, Jul 12, 2024 at 10:09 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2024-07-12 08:57, Joel Fernandes wrote:
> > On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote:
> [...]
> >> Existing use cases
> >> -------------------------
> >>
> >> - A latency sensitive workload on the guest might need more than one
> >> time slice to complete, but should not block any higher priority task
> >> in the host. In our design, the latency sensitive workload shares its
> >> priority requirements to host(RT priority, cfs nice value etc). Host
> >> implementation of the protocol sets the priority of the vcpu task
> >> accordingly so that the host scheduler can make an educated decision
> >> on the next task to run. This makes sure that host processes and vcpu
> >> tasks compete fairly for the cpu resource.
>
> AFAIU, the information you need to convey to achieve this is the priority
> of the task within the guest. This information need to reach the host
> scheduler to make informed decision.
>
> One thing that is unclear about this is what is the acceptable
> overhead/latency to push this information from guest to host ?
> Is an hypercall OK or does it need to be exchanged over a memory
> mapping shared between guest and host ?

Shared memory for the boost (Can do it later during host preemption).
But for unboost, we possibly need a hypercall in addition to it as
well.

>
> Hypercalls provide simple ABIs across guest/host, and they allow
> the guest to immediately notify the host (similar to an interrupt).
>
> Shared memory mapping will require a carefully crafted ABI layout,
> and will only allow the host to use the information provided when
> the host runs. Therefore, if the choice is to share this information
> only through shared memory, the host scheduler will only be able to
> read it when it runs, so in hypercall, interrupt, and so on.

The initial idea was to handle the details/format/allocation of the
shared memory out-of-band in a driver, but then later the rseq idea
came up.

> >> - Guest should be able to notify the host that it is running a lower
> >> priority task so that the host can reschedule it if needed. As
> >> mentioned before, the guest shares the priority with the host and the
> >> host takes a better scheduling decision.
>
> It is unclear to me whether this information needs to be "pushed"
> from guest to host (e.g. hypercall) in a way that allows the host
> to immediately act on this information, or if it is OK to have the
> host read this information when its scheduler happens to run.

For boosting, there is no need to immediately push. Only on preemption.

> >> - Proactive vcpu boosting for events like interrupt injection.
> >> Depending on the guest for boost request might be too late as the vcpu
> >> might not be scheduled to run even after interrupt injection. Host
> >> implementation of the protocol boosts the vcpu tasks priority so that
> >> it gets a better chance of immediately being scheduled and guest can
> >> handle the interrupt with minimal latency. Once the guest is done
> >> handling the interrupt, it can notify the host and lower the priority
> >> of the vcpu task.
>
> This appears to be a scenario where the host sets a "high priority", and
> the guest clears it when it is done with the irq handler. I guess it can
> be done either ways (hypercall or shared memory), but the choice would
> depend on the parameters identified above: acceptable overhead vs acceptable
> latency to inform the host scheduler.

Yes, we have found ways to reduce/make fewer hypercalls on unboost.

> >> - Guests which assign specialized tasks to specific vcpus can share
> >> that information with the host so that host can try to avoid
> >> colocation of those cpus in a single physical cpu. for eg: there are
> >> interrupt pinning use cases where specific cpus are chosen to handle
> >> critical interrupts and passing this information to the host could be
> >> useful.
>
> How frequently is this topology expected to change ? Is it something that
> is set once when the guest starts and then is fixed ? How often it changes
> will likely affect the tradeoffs here.

Yes, will be fixed.

> >> - Another use case is the sharing of cpu capacity details between
> >> guest and host. Sharing the host cpu's load with the guest will enable
> >> the guest to schedule latency sensitive tasks on the best possible
> >> vcpu. This could be partially achievable by steal time, but steal time
> >> is more apparent on busy vcpus. There are workloads which are mostly
> >> sleepers, but wake up intermittently to serve short latency sensitive
> >> workloads. input event handlers in chrome is one such example.
>
> OK so for this use-case information goes the other way around: from host
> to guest. Here the shared mapping seems better than polling the state
> through an hypercall.

Yes, FWIW this particular part is for future and not initially required per-se.

> >> Data from the prototype implementation shows promising improvement in
> >> reducing latencies. Data was shared in the v1 cover letter. We have
> >> not implemented the capacity based placement policies yet, but plan to
> >> do that soon and have some real numbers to share.
> >>
> >> Ideas brought up during offlist discussion
> >> -------------------------------------------------------
> >>
> >> 1. rseq based timeslice extension mechanism[1]
> >>
> >> While the rseq based mechanism helps in giving the vcpu task one more
> >> time slice, it will not help in the other use cases. We had a chat
> >> with Steve and the rseq mechanism was mainly for improving lock
> >> contention and would not work best with vcpu boosting considering all
> >> the use cases above. RT or high priority tasks in the VM would often
> >> need more than one time slice to complete its work and at the same,
> >> should not be hurting the host workloads. The goal for the above use
> >> cases is not requesting an extra slice, but to modify the priority in
> >> such a way that host processes and guest processes get a fair way to
> >> compete for cpu resources. This also means that vcpu task can request
> >> a lower priority when it is running lower priority tasks in the VM.
> >
> > I was looking at the rseq on request from the KVM call, however it does not
> > make sense to me yet how to expose the rseq area via the Guest VA to the host
> > kernel.  rseq is for userspace to kernel, not VM to kernel.
> >
> > Steven Rostedt said as much as well, thoughts? Add Mathieu as well.
>
> I'm not sure that rseq would help at all here, but I think we may want to
> borrow concepts of data sitting in shared memory across privilege levels
> and apply them to VMs.
>
> If some of the ideas end up being useful *outside* of the context of VMs,
> then I'd be willing to consider adding fields to rseq. But as long as it is
> VM-specific, I suspect you'd be better with dedicated per-vcpu pages which
> you can safely share across host/guest kernels.

Yes, this was the initial plan. I also feel rseq cannot be applied here.

> > This idea seems to suffer from the same vDSO over-engineering below, rseq
> > does not seem to fit.
> >
> > Steven Rostedt told me, what we instead need is a tracepoint callback in a
> > driver, that does the boosting.
>
> I utterly dislike changing the system behavior through tracepoints. They were
> designed to observe the system, not modify its behavior. If people start abusing
> them, then subsystem maintainers will stop adding them. Please don't do that.
> Add a notifier or think about integrating what you are planning to add into the
> driver instead.

Well, we do have "raw" tracepoints not accessible from userspace, so
you're saying even those are off limits for adding callbacks?

 - Joel
Steven Rostedt July 12, 2024, 4:30 p.m. UTC | #12
On Fri, 12 Jul 2024 11:32:30 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> >>> I was looking at the rseq on request from the KVM call, however it does not
> >>> make sense to me yet how to expose the rseq area via the Guest VA to the host
> >>> kernel.  rseq is for userspace to kernel, not VM to kernel.  
> > 
> > Any memory that is exposed to host userspace can be exposed to the guest.  Things
> > like this are implemented via "overlay" pages, where the guest asks host userspace
> > to map the magic page (rseq in this case) at GPA 'x'.  Userspace then creates a
> > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
> > address of the page containing the rseq structure associated with the vCPU (in
> > pretty much every modern VMM, each vCPU has a dedicated task/thread).
> > 
> > A that point, the vCPU can read/write the rseq structure directly.  

So basically, the vCPU thread can just create a virtio device that
exposes the rseq memory to the guest kernel?

One other issue we need to worry about is that IIUC rseq memory is
allocated by the guest/user, not the host kernel. This means it can be
swapped out. The code that handles this needs to be able to handle user
page faults.

> 
> This helps me understand what you are trying to achieve. I disagree with
> some aspects of the design you present above: mainly the lack of
> isolation between the guest kernel and the host task doing the KVM_RUN.
> We do not want to let the guest kernel store to rseq fields that would
> result in getting the host task killed (e.g. a bogus rseq_cs pointer).
> But this is something we can improve upon once we understand what we
> are trying to achieve.
> 
> > 
> > The reason us KVM folks are pushing y'all towards something like rseq is that
> > (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU
> > is actually just priority boosting a task.  So rather than invent something
> > virtualization specific, invent a mechanism for priority boosting from userspace
> > without a syscall, and then extend it to the virtualization use case.
> >   
> [...]
> 
> OK, so how about we expose "offsets" tuning the base values ?
> 
> - The task doing KVM_RUN, just like any other task, has its "priority"
>    value as set by setpriority(2).
> 
> - We introduce two new fields in the per-thread struct rseq, which is
>    mapped in the host task doing KVM_RUN and readable from the scheduler:
> 
>    - __s32 prio_offset; /* Priority offset to apply on the current task priority. */
> 
>    - __u64 vcpu_sched;  /* Pointer to a struct vcpu_sched in user-space */
> 
>      vcpu_sched would be a userspace pointer to a new vcpu_sched structure,
>      which would be typically NULL except for tasks doing KVM_RUN. This would
>      sit in its own pages per vcpu, which takes care of isolation between guest
>      kernel and host process. Those would be RW by the guest kernel as
>      well and contain e.g.:

Hmm, maybe not make this only vcpu specific, but perhaps this can be
useful for user space tasks that want to dynamically change their
priority without a system call. It could do the same thing. Yeah, yeah,
I may be coming up with a solution in search of a problem ;-)

-- Steve

> 
>      struct vcpu_sched {
>          __u32 len;  /* Length of active fields. */
> 
>          __s32 prio_offset;
>          __s32 cpu_capacity_offset;
>          [...]
>      };
> 
> So when the host kernel try to calculate the effective priority of a task
> doing KVM_RUN, it would basically start from its current priority, and offset
> by (rseq->prio_offset + rseq->vcpu_sched->prio_offset).
> 
> The cpu_capacity_offset would be populated by the host kernel and read by the
> guest kernel scheduler for scheduling/migration decisions.
> 
> I'm certainly missing details about how priority offsets should be bounded for
> given tasks. This could be an extension to setrlimit(2).
> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
>
Sean Christopherson July 12, 2024, 4:39 p.m. UTC | #13
On Fri, Jul 12, 2024, Steven Rostedt wrote:
> On Fri, 12 Jul 2024 11:32:30 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> > >>> I was looking at the rseq on request from the KVM call, however it does not
> > >>> make sense to me yet how to expose the rseq area via the Guest VA to the host
> > >>> kernel.  rseq is for userspace to kernel, not VM to kernel.  
> > > 
> > > Any memory that is exposed to host userspace can be exposed to the guest.  Things
> > > like this are implemented via "overlay" pages, where the guest asks host userspace
> > > to map the magic page (rseq in this case) at GPA 'x'.  Userspace then creates a
> > > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the
> > > address of the page containing the rseq structure associated with the vCPU (in
> > > pretty much every modern VMM, each vCPU has a dedicated task/thread).
> > > 
> > > A that point, the vCPU can read/write the rseq structure directly.  
> 
> So basically, the vCPU thread can just create a virtio device that
> exposes the rseq memory to the guest kernel?
> 
> One other issue we need to worry about is that IIUC rseq memory is
> allocated by the guest/user, not the host kernel. This means it can be
> swapped out. The code that handles this needs to be able to handle user
> page faults.

This is a non-issue, it will Just Work, same as any other memory that is exposed
to the guest and can be reclaimed/swapped/migrated..

If the host swaps out the rseq page, mmu_notifiers will call into KVM and KVM will
unmap the page from the guest.  If/when the page is accessed by the guest, KVM
will fault the page back into the host's primary MMU, and then map the new pfn
into the guest.
Sean Christopherson July 12, 2024, 4:44 p.m. UTC | #14
On Fri, Jul 12, 2024, Steven Rostedt wrote:
> On Fri, 12 Jul 2024 10:09:03 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> > > 
> > > Steven Rostedt told me, what we instead need is a tracepoint callback in a
> > > driver, that does the boosting.  
> > 
> > I utterly dislike changing the system behavior through tracepoints. They were
> > designed to observe the system, not modify its behavior. If people start abusing
> > them, then subsystem maintainers will stop adding them. Please don't do that.
> > Add a notifier or think about integrating what you are planning to add into the
> > driver instead.
> 
> I tend to agree that a notifier would be much better than using
> tracepoints, but then I also think eBPF has already let that cat out of
> the bag. :-p
> 
> All we need is a notifier that gets called at every VMEXIT.

Why?  The only argument I've seen for needing to hook VM-Exit is so that the
host can speculatively boost the priority of the vCPU when deliverying an IRQ,
but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
_before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
modern hardware that supports posted interrupts and IPI virtualization, i.e. for
which there will be no VM-Exit.
Joel Fernandes July 12, 2024, 4:50 p.m. UTC | #15
On Fri, Jul 12, 2024 at 09:44:16AM -0700, Sean Christopherson wrote:
> On Fri, Jul 12, 2024, Steven Rostedt wrote:
> > On Fri, 12 Jul 2024 10:09:03 -0400
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > 
> > > > 
> > > > Steven Rostedt told me, what we instead need is a tracepoint callback in a
> > > > driver, that does the boosting.  
> > > 
> > > I utterly dislike changing the system behavior through tracepoints. They were
> > > designed to observe the system, not modify its behavior. If people start abusing
> > > them, then subsystem maintainers will stop adding them. Please don't do that.
> > > Add a notifier or think about integrating what you are planning to add into the
> > > driver instead.
> > 
> > I tend to agree that a notifier would be much better than using
> > tracepoints, but then I also think eBPF has already let that cat out of
> > the bag. :-p
> > 
> > All we need is a notifier that gets called at every VMEXIT.
> 
> Why?  The only argument I've seen for needing to hook VM-Exit is so that the
> host can speculatively boost the priority of the vCPU when deliverying an IRQ,
> but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
> _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
> modern hardware that supports posted interrupts and IPI virtualization, i.e. for
> which there will be no VM-Exit.

I am a bit confused by your statement Sean, because if a higher prio HOST
thread wakes up on the vCPU thread's phyiscal CPU, then a VM-Exit should
happen. That has nothing to do with IRQ delivery.  What am I missing?

thanks,

 - Joel
Steven Rostedt July 12, 2024, 5:02 p.m. UTC | #16
On Fri, 12 Jul 2024 09:39:52 -0700
Sean Christopherson <seanjc@google.com> wrote:

> > 
> > One other issue we need to worry about is that IIUC rseq memory is
> > allocated by the guest/user, not the host kernel. This means it can be
> > swapped out. The code that handles this needs to be able to handle user
> > page faults.  
> 
> This is a non-issue, it will Just Work, same as any other memory that is exposed
> to the guest and can be reclaimed/swapped/migrated..
> 
> If the host swaps out the rseq page, mmu_notifiers will call into KVM and KVM will
> unmap the page from the guest.  If/when the page is accessed by the guest, KVM
> will fault the page back into the host's primary MMU, and then map the new pfn
> into the guest.

My comment is that in the host kernel, the access to this memory needs
to be user page fault safe. You can't call it in atomic context.

-- Steve
Sean Christopherson July 12, 2024, 5:08 p.m. UTC | #17
On Fri, Jul 12, 2024, Joel Fernandes wrote:
> On Fri, Jul 12, 2024 at 09:44:16AM -0700, Sean Christopherson wrote:
> > On Fri, Jul 12, 2024, Steven Rostedt wrote:
> > > On Fri, 12 Jul 2024 10:09:03 -0400
> > > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > > 
> > > > > 
> > > > > Steven Rostedt told me, what we instead need is a tracepoint callback in a
> > > > > driver, that does the boosting.  
> > > > 
> > > > I utterly dislike changing the system behavior through tracepoints. They were
> > > > designed to observe the system, not modify its behavior. If people start abusing
> > > > them, then subsystem maintainers will stop adding them. Please don't do that.
> > > > Add a notifier or think about integrating what you are planning to add into the
> > > > driver instead.
> > > 
> > > I tend to agree that a notifier would be much better than using
> > > tracepoints, but then I also think eBPF has already let that cat out of
> > > the bag. :-p
> > > 
> > > All we need is a notifier that gets called at every VMEXIT.
> > 
> > Why?  The only argument I've seen for needing to hook VM-Exit is so that the
> > host can speculatively boost the priority of the vCPU when deliverying an IRQ,
> > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
> > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
> > modern hardware that supports posted interrupts and IPI virtualization, i.e. for
> > which there will be no VM-Exit.
> 
> I am a bit confused by your statement Sean, because if a higher prio HOST
> thread wakes up on the vCPU thread's phyiscal CPU, then a VM-Exit should
> happen. That has nothing to do with IRQ delivery.  What am I missing?

Why does that require hooking VM-Exit?
Steven Rostedt July 12, 2024, 5:12 p.m. UTC | #18
On Fri, 12 Jul 2024 09:44:16 -0700
Sean Christopherson <seanjc@google.com> wrote:

> > All we need is a notifier that gets called at every VMEXIT.  
> 
> Why?  The only argument I've seen for needing to hook VM-Exit is so that the
> host can speculatively boost the priority of the vCPU when deliverying an IRQ,
> but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
> _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
> modern hardware that supports posted interrupts and IPI virtualization, i.e. for
> which there will be no VM-Exit.

No. The speculatively boost was for something else, but slightly
related. I guess the ideal there was to have the interrupt coming in
boost the vCPU because the interrupt could be waking an RT task. It may
still be something needed, but that's not what I'm talking about here.

The idea here is when an RT task is scheduled in on the guest, we want
to lazily boost it. As long as the vCPU is running on the CPU, we do
not need to do anything. If the RT task is scheduled for a very short
time, it should not need to call any hypercall. It would set the shared
memory to the new priority when the RT task is scheduled, and then put
back the lower priority when it is scheduled out and a SCHED_OTHER task
is scheduled in.

Now if the vCPU gets preempted, it is this moment that we need the host
kernel to look at the current priority of the task thread running on
the vCPU. If it is an RT task, we need to boost the vCPU to that
priority, so that a lower priority host thread does not interrupt it.

The host should also set a bit in the shared memory to tell the guest
that it was boosted. Then when the vCPU schedules a lower priority task
than what is in shared memory, and the bit is set that tells the guest
the host boosted the vCPU, it needs to make a hypercall to tell the
host that it can lower its priority again.

The incoming irq is to handle the race between the event that wakes the
RT task, and the RT task getting a chance to run. If the preemption
happens there, the vCPU may never have a chance to notify the host that
it wants to run an RT task.

-- Steve
Steven Rostedt July 12, 2024, 5:14 p.m. UTC | #19
On Fri, 12 Jul 2024 10:08:43 -0700
Sean Christopherson <seanjc@google.com> wrote:

> > I am a bit confused by your statement Sean, because if a higher prio HOST
> > thread wakes up on the vCPU thread's phyiscal CPU, then a VM-Exit should
> > happen. That has nothing to do with IRQ delivery.  What am I missing?  
> 
> Why does that require hooking VM-Exit?

To do the lazy boosting. That's the time that the host can see the
priority of the currently running task.

-- Steve
Mathieu Desnoyers July 12, 2024, 5:28 p.m. UTC | #20
On 2024-07-12 12:24, Joel Fernandes wrote:
> On Fri, Jul 12, 2024 at 10:09 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
[...]
>>>
>>> Steven Rostedt told me, what we instead need is a tracepoint callback in a
>>> driver, that does the boosting.
>>
>> I utterly dislike changing the system behavior through tracepoints. They were
>> designed to observe the system, not modify its behavior. If people start abusing
>> them, then subsystem maintainers will stop adding them. Please don't do that.
>> Add a notifier or think about integrating what you are planning to add into the
>> driver instead.
> 
> Well, we do have "raw" tracepoints not accessible from userspace, so
> you're saying even those are off limits for adding callbacks?

Yes. Even the "raw" tracepoints were designed as an "observation only"
API. Using them in lieu of notifiers is really repurposing them for
something they were not meant to do.

Just in terms of maintainability at the caller site, we should be
allowed to consider _all_ tracepoints as mostly exempt from side-effects
outside of the data structures within the attached tracers. This is not
true anymore if they are repurposed as notifiers.

Thanks,

Mathieu
Sean Christopherson July 16, 2024, 11:44 p.m. UTC | #21
On Fri, Jul 12, 2024, Steven Rostedt wrote:
> On Fri, 12 Jul 2024 09:44:16 -0700
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > > All we need is a notifier that gets called at every VMEXIT.  
> > 
> > Why?  The only argument I've seen for needing to hook VM-Exit is so that the
> > host can speculatively boost the priority of the vCPU when deliverying an IRQ,
> > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
> > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
> > modern hardware that supports posted interrupts and IPI virtualization, i.e. for
> > which there will be no VM-Exit.
> 
> No. The speculatively boost was for something else, but slightly
> related. I guess the ideal there was to have the interrupt coming in
> boost the vCPU because the interrupt could be waking an RT task. It may
> still be something needed, but that's not what I'm talking about here.
> 
> The idea here is when an RT task is scheduled in on the guest, we want
> to lazily boost it. As long as the vCPU is running on the CPU, we do
> not need to do anything. If the RT task is scheduled for a very short
> time, it should not need to call any hypercall. It would set the shared
> memory to the new priority when the RT task is scheduled, and then put
> back the lower priority when it is scheduled out and a SCHED_OTHER task
> is scheduled in.
> 
> Now if the vCPU gets preempted, it is this moment that we need the host
> kernel to look at the current priority of the task thread running on
> the vCPU. If it is an RT task, we need to boost the vCPU to that
> priority, so that a lower priority host thread does not interrupt it.

I got all that, but I still don't see any need to hook VM-Exit.  If the vCPU gets
preempted, the host scheduler is already getting "notified", otherwise the vCPU
would still be scheduled in, i.e. wouldn't have been preempted.

> The host should also set a bit in the shared memory to tell the guest
> that it was boosted. Then when the vCPU schedules a lower priority task
> than what is in shared memory, and the bit is set that tells the guest
> the host boosted the vCPU, it needs to make a hypercall to tell the
> host that it can lower its priority again.

Which again doesn't _need_ a dedicated/manual VM-Exit.  E.g. why force the host
to reasses the priority instead of simply waiting until the next reschedule?  If
the host is running tickless, then presumably there is a scheduling entity running
on a different pCPU, i.e. that can react to vCPU priority changes without needing
a VM-Exit.
Steven Rostedt July 17, 2024, 12:13 a.m. UTC | #22
On Tue, 16 Jul 2024 16:44:05 -0700
Sean Christopherson <seanjc@google.com> wrote:
> > 
> > Now if the vCPU gets preempted, it is this moment that we need the host
> > kernel to look at the current priority of the task thread running on
> > the vCPU. If it is an RT task, we need to boost the vCPU to that
> > priority, so that a lower priority host thread does not interrupt it.  
> 
> I got all that, but I still don't see any need to hook VM-Exit.  If the vCPU gets
> preempted, the host scheduler is already getting "notified", otherwise the vCPU
> would still be scheduled in, i.e. wouldn't have been preempted.

The guest wants to lazily up its priority when needed. So, it changes its
priority on this shared memory, but the host doesn't know about the raised
priority, and decides to preempt it (where it would not if it knew the
priority was raised). Then it exits into the host via VMEXIT. When else is
the host going to know of this priority changed?

> 
> > The host should also set a bit in the shared memory to tell the guest
> > that it was boosted. Then when the vCPU schedules a lower priority task
> > than what is in shared memory, and the bit is set that tells the guest
> > the host boosted the vCPU, it needs to make a hypercall to tell the
> > host that it can lower its priority again.  
> 
> Which again doesn't _need_ a dedicated/manual VM-Exit.  E.g. why force the host
> to reasses the priority instead of simply waiting until the next reschedule?  If
> the host is running tickless, then presumably there is a scheduling entity running
> on a different pCPU, i.e. that can react to vCPU priority changes without needing
> a VM-Exit.

This is done in a shared memory location. The guest can raise and lower its
priority via writing into the shared memory. It may raise and lower it back
without the host ever knowing. No hypercall needed.

But if it raises its priority, and the host decides to schedule it because
the host is unaware of its raised priority, it will preempt it. Then when
it exits into the host (via VMEXIT) this is the first time the host will
know that its priority was raised, and then we can call something like
rt_mutex_setprio() to lazily change its priority. It would then also set a
bit to inform the guest that the host knows of the change, and when the
guest lowers its priority, it will now need to make a hypercall to tell the
kernel its priority is low again, and it's OK to preempt it normally.

This is similar to how some architectures do lazy irq disabling. Where they
only set some memory that says interrupts are disabled. But interrupts only
get disabled if an interrupt goes off and the code sees it's "soft
disabled", and then will disable interrupts. When the interrupts are
enabled again, it then calls the interrupt handler.

What are you suggesting to do for this fast way of increasing and
decreasing the priority of tasks?

-- Steve
Joel Fernandes July 17, 2024, 5:16 a.m. UTC | #23
On Tue, Jul 16, 2024 at 7:44 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jul 12, 2024, Steven Rostedt wrote:
> > On Fri, 12 Jul 2024 09:44:16 -0700
> > Sean Christopherson <seanjc@google.com> wrote:
> >
> > > > All we need is a notifier that gets called at every VMEXIT.
> > >
> > > Why?  The only argument I've seen for needing to hook VM-Exit is so that the
> > > host can speculatively boost the priority of the vCPU when deliverying an IRQ,
> > > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
> > > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
> > > modern hardware that supports posted interrupts and IPI virtualization, i.e. for
> > > which there will be no VM-Exit.
> >
> > No. The speculatively boost was for something else, but slightly
> > related. I guess the ideal there was to have the interrupt coming in
> > boost the vCPU because the interrupt could be waking an RT task. It may
> > still be something needed, but that's not what I'm talking about here.
> >
> > The idea here is when an RT task is scheduled in on the guest, we want
> > to lazily boost it. As long as the vCPU is running on the CPU, we do
> > not need to do anything. If the RT task is scheduled for a very short
> > time, it should not need to call any hypercall. It would set the shared
> > memory to the new priority when the RT task is scheduled, and then put
> > back the lower priority when it is scheduled out and a SCHED_OTHER task
> > is scheduled in.
> >
> > Now if the vCPU gets preempted, it is this moment that we need the host
> > kernel to look at the current priority of the task thread running on
> > the vCPU. If it is an RT task, we need to boost the vCPU to that
> > priority, so that a lower priority host thread does not interrupt it.
>
> I got all that, but I still don't see any need to hook VM-Exit.  If the vCPU gets
> preempted, the host scheduler is already getting "notified", otherwise the vCPU
> would still be scheduled in, i.e. wouldn't have been preempted.

What you're saying is the scheduler should change the priority of the
vCPU thread dynamically. That's really not the job of the scheduler.
The user of the scheduler is what changes the priority of threads, not
the scheduler itself.

Joel
Sean Christopherson July 17, 2024, 2:14 p.m. UTC | #24
On Wed, Jul 17, 2024, Joel Fernandes wrote:
> On Tue, Jul 16, 2024 at 7:44 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Jul 12, 2024, Steven Rostedt wrote:
> > > On Fri, 12 Jul 2024 09:44:16 -0700
> > > Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > > > All we need is a notifier that gets called at every VMEXIT.
> > > >
> > > > Why?  The only argument I've seen for needing to hook VM-Exit is so that the
> > > > host can speculatively boost the priority of the vCPU when deliverying an IRQ,
> > > > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted
> > > > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on
> > > > modern hardware that supports posted interrupts and IPI virtualization, i.e. for
> > > > which there will be no VM-Exit.
> > >
> > > No. The speculatively boost was for something else, but slightly
> > > related. I guess the ideal there was to have the interrupt coming in
> > > boost the vCPU because the interrupt could be waking an RT task. It may
> > > still be something needed, but that's not what I'm talking about here.
> > >
> > > The idea here is when an RT task is scheduled in on the guest, we want
> > > to lazily boost it. As long as the vCPU is running on the CPU, we do
> > > not need to do anything. If the RT task is scheduled for a very short
> > > time, it should not need to call any hypercall. It would set the shared
> > > memory to the new priority when the RT task is scheduled, and then put
> > > back the lower priority when it is scheduled out and a SCHED_OTHER task
> > > is scheduled in.
> > >
> > > Now if the vCPU gets preempted, it is this moment that we need the host
> > > kernel to look at the current priority of the task thread running on
> > > the vCPU. If it is an RT task, we need to boost the vCPU to that
> > > priority, so that a lower priority host thread does not interrupt it.
> >
> > I got all that, but I still don't see any need to hook VM-Exit.  If the vCPU gets
> > preempted, the host scheduler is already getting "notified", otherwise the vCPU
> > would still be scheduled in, i.e. wouldn't have been preempted.
> 
> What you're saying is the scheduler should change the priority of the
> vCPU thread dynamically. That's really not the job of the scheduler.
> The user of the scheduler is what changes the priority of threads, not
> the scheduler itself.

No.  If we go the proposed route[*] of adding a data structure that lets userspace
and/or the guest express/adjust the task's priority, then the scheduler simply
checks that data structure when querying the priority of a task.

[*] https://lore.kernel.org/all/ZpFWfInsXQdPJC0V@google.com
Steven Rostedt July 17, 2024, 2:36 p.m. UTC | #25
On Wed, 17 Jul 2024 07:14:59 -0700
Sean Christopherson <seanjc@google.com> wrote:

> > What you're saying is the scheduler should change the priority of the
> > vCPU thread dynamically. That's really not the job of the scheduler.
> > The user of the scheduler is what changes the priority of threads, not
> > the scheduler itself.  
> 
> No.  If we go the proposed route[*] of adding a data structure that lets userspace
> and/or the guest express/adjust the task's priority, then the scheduler simply
> checks that data structure when querying the priority of a task.

The problem with that is the only use case for such a feature is for
vCPUS. There's no use case for a single thread to up and down its
priority. I work a lot in RT applications (well, not as much anymore,
but my career was heavy into it). And I can't see any use case where a
single thread would bounce its priority around. In fact, if I did see
that, I would complain that it was a poorly designed system.

Now for a guest kernel, that's very different. It has to handle things
like priority inheritance and such, where bouncing a threads (or its
own vCPU thread) priority most definitely makes sense.

So you are requesting that we add a bad user space interface to allow
lazy priority management from a thread so that we can use it in the
proper use case of a vCPU?

-- Steve
Steven Rostedt July 17, 2024, 2:52 p.m. UTC | #26
On Wed, 17 Jul 2024 10:36:47 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> The problem with that is the only use case for such a feature is for
> vCPUS. There's no use case for a single thread to up and down its
> priority. I work a lot in RT applications (well, not as much anymore,
> but my career was heavy into it). And I can't see any use case where a
> single thread would bounce its priority around. In fact, if I did see
> that, I would complain that it was a poorly designed system.
> 
> Now for a guest kernel, that's very different. It has to handle things
> like priority inheritance and such, where bouncing a threads (or its
> own vCPU thread) priority most definitely makes sense.
> 
> So you are requesting that we add a bad user space interface to allow
> lazy priority management from a thread so that we can use it in the
> proper use case of a vCPU?

Now I stated the above thinking you wanted to add a generic interface
for all user space. But perhaps there is a way to get this to be done
by the scheduler itself. But its use case is still only for VMs.

We could possibly add a new sched class that has a dynamic priority.
That is, it can switch between other sched classes. A vCPU thread could
be assigned to this class from inside the kernel (via a virtio device)
where this is not exposed to user space at all. Then the virtio device
would control the mapping of a page between the vCPU thread and the
host kernel. When this task gets scheduled, it can call into the code
that handles the dynamic priority. This will require buy-in from the
scheduler folks.

This could also handle the case of a vCPU being woken up by an
interrupt, as the hooks could be there on the wakeup side as well.

Thoughts?

-- Steve
Steven Rostedt July 17, 2024, 3:20 p.m. UTC | #27
On Wed, 17 Jul 2024 10:52:33 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> We could possibly add a new sched class that has a dynamic priority.

It wouldn't need to be a new sched class. This could work with just a
task_struct flag.

It would only need to be checked in pick_next_task() and
try_to_wake_up(). It would require that the shared memory has to be
allocated by the host kernel and always present (unlike rseq). But this
coming from a virtio device driver, that shouldn't be a problem.

If this flag is set on current, then the first thing that
pick_next_task() should do is to see if it needs to change current's
priority and policy (via a callback to the driver). And then it can
decide what task to pick, as if current was boosted, it could very well
be the next task again.

In try_to_wake_up(), if the task waking up has this flag set, it could
boost it via an option set by the virtio device. This would allow it to
preempt the current process if necessary and get on the CPU. Then the
guest would be require to lower its priority if it the boost was not
needed.

Hmm, this could work.

-- Steve
Suleiman Souhlal July 17, 2024, 5:03 p.m. UTC | #28
On Thu, Jul 18, 2024 at 12:20 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 17 Jul 2024 10:52:33 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > We could possibly add a new sched class that has a dynamic priority.
>
> It wouldn't need to be a new sched class. This could work with just a
> task_struct flag.
>
> It would only need to be checked in pick_next_task() and
> try_to_wake_up(). It would require that the shared memory has to be
> allocated by the host kernel and always present (unlike rseq). But this
> coming from a virtio device driver, that shouldn't be a problem.
>
> If this flag is set on current, then the first thing that
> pick_next_task() should do is to see if it needs to change current's
> priority and policy (via a callback to the driver). And then it can
> decide what task to pick, as if current was boosted, it could very well
> be the next task again.
>
> In try_to_wake_up(), if the task waking up has this flag set, it could
> boost it via an option set by the virtio device. This would allow it to
> preempt the current process if necessary and get on the CPU. Then the
> guest would be require to lower its priority if it the boost was not
> needed.
>
> Hmm, this could work.

For what it's worth, I proposed something somewhat conceptually similar before:
https://lore.kernel.org/kvm/CABCjUKBXCFO4-cXAUdbYEKMz4VyvZ5hD-1yP9H7S7eL8XsqO-g@mail.gmail.com/T/

Guests VCPUs would report their preempt_count to the host and the host
would use that to try not to preempt a VCPU that was in a critical
section (with some simple safeguards in case the guest was not well
behaved).
(It worked by adding a "may_preempt" notifier that would get called in
schedule(), whose return value would determine whether we'd try to
schedule away from current or not.)

It was VM specific, but the same idea could be made to work for
generic userspace tasks.

-- Suleiman
Joel Fernandes July 17, 2024, 8:57 p.m. UTC | #29
On Wed, Jul 17, 2024 at 11:20 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 17 Jul 2024 10:52:33 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > We could possibly add a new sched class that has a dynamic priority.
>
> It wouldn't need to be a new sched class. This could work with just a
> task_struct flag.
>
> It would only need to be checked in pick_next_task() and
> try_to_wake_up(). It would require that the shared memory has to be
> allocated by the host kernel and always present (unlike rseq). But this
> coming from a virtio device driver, that shouldn't be a problem.

Problem is its not only about preemption, if we set the vCPU boosted
to RT class, and another RT task is already running on the same CPU,
then the vCPU thread should get migrated to different CPU. We can't do
that I think if we just did it without doing a proper
sched_setscheduler() / sched_setattr() and let the scheduler handle
things.  Vineeth's patches was doing that in VMEXIT..

 - Joel
Steven Rostedt July 17, 2024, 9 p.m. UTC | #30
On Wed, 17 Jul 2024 16:57:43 -0400
Joel Fernandes <joel@joelfernandes.org> wrote:

> On Wed, Jul 17, 2024 at 11:20 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Wed, 17 Jul 2024 10:52:33 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >  
> > > We could possibly add a new sched class that has a dynamic priority.  
> >
> > It wouldn't need to be a new sched class. This could work with just a
> > task_struct flag.
> >
> > It would only need to be checked in pick_next_task() and
> > try_to_wake_up(). It would require that the shared memory has to be
> > allocated by the host kernel and always present (unlike rseq). But this
> > coming from a virtio device driver, that shouldn't be a problem.  
> 
> Problem is its not only about preemption, if we set the vCPU boosted
> to RT class, and another RT task is already running on the same CPU,

That can only happen on wakeup (interrupt). As the point of lazy
priority changing, it is only done when the vCPU is running.

-- Steve


> then the vCPU thread should get migrated to different CPU. We can't do
> that I think if we just did it without doing a proper
> sched_setscheduler() / sched_setattr() and let the scheduler handle
> things.  Vineeth's patches was doing that in VMEXIT..
Joel Fernandes July 17, 2024, 9:09 p.m. UTC | #31
On Wed, Jul 17, 2024 at 5:00 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 17 Jul 2024 16:57:43 -0400
> Joel Fernandes <joel@joelfernandes.org> wrote:
>
> > On Wed, Jul 17, 2024 at 11:20 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > On Wed, 17 Jul 2024 10:52:33 -0400
> > > Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > > We could possibly add a new sched class that has a dynamic priority.
> > >
> > > It wouldn't need to be a new sched class. This could work with just a
> > > task_struct flag.
> > >
> > > It would only need to be checked in pick_next_task() and
> > > try_to_wake_up(). It would require that the shared memory has to be
> > > allocated by the host kernel and always present (unlike rseq). But this
> > > coming from a virtio device driver, that shouldn't be a problem.
> >
> > Problem is its not only about preemption, if we set the vCPU boosted
> > to RT class, and another RT task is already running on the same CPU,
>
> That can only happen on wakeup (interrupt). As the point of lazy
> priority changing, it is only done when the vCPU is running.

True, but I think it will miss stuff related to load balancing, say if
the "boost" is a higher CFS priority. Then someone has to pull the
running vCPU thread to another CPU etc... IMO it is better to set the
priority/class externally and let the scheduler deal with it. Let me
think some more about your idea though..

thanks,

 - Joel