Message ID | 20240403140116.3002809-1-vineeth@bitbyteword.org (mailing list archive) |
---|---|
Headers | show |
Series | Paravirt Scheduling (Dynamic vcpu priority management) | expand |
Sorry I missed sched_ext folks, adding them as well. Thanks, Vineeth On Wed, Apr 3, 2024 at 10:01 AM Vineeth Pillai (Google) <vineeth@bitbyteword.org> wrote: > > Double scheduling is a concern with virtualization hosts where the host > schedules vcpus without knowing whats run by the vcpu and guest schedules > tasks without knowing where the vcpu is physically running. This causes > issues related to latencies, power consumption, resource utilization > etc. An ideal solution would be to have a cooperative scheduling > framework where the guest and host shares scheduling related information > and makes an educated scheduling decision to optimally handle the > workloads. As a first step, we are taking a stab at reducing latencies > for latency sensitive workloads in the guest. > > v1 RFC[1] was posted in December 2023. The main disagreement was in the > implementation where the patch was making scheduling policy decisions > in kvm and kvm is not the right place to do it. The suggestion was to > move the polcy decisions outside of kvm and let kvm only handle the > notifications needed to make the policy decisions. This patch series is > an iterative step towards implementing the feature as a layered > design where the policy could be implemented outside of kvm as a > kernel built-in, a kernel module or a bpf program. > > This design comprises mainly of 4 components: > > - pvsched driver: Implements the scheduling policies. Register with > host with a set of callbacks that hypervisor(kvm) can use to notify > vcpu events that the driver is interested in. The callback will be > passed in the address of shared memory so that the driver can get > scheduling information shared by the guest and also update the > scheduling policies set by the driver. > - kvm component: Selects the pvsched driver for a guest and notifies > the driver via callbacks for events that the driver is interested > in. Also interface with the guest in retreiving the shared memory > region for sharing the scheduling information. > - host kernel component: Implements the APIs for: > - pvsched driver for register/unregister to the host kernel, and > - hypervisor for assingning/unassigning driver for guests. > - guest component: Implements a framework for sharing the scheduling > information with the pvsched driver through kvm. > > There is another component that we refer to as pvsched protocol. This > defines the details about shared memory layout, information sharing and > sheduling policy decisions. The protocol need not be part of the kernel > and can be defined separately based on the use case and requirements. > Both guest and the selected pvsched driver need to match the protocol > for the feature to work. Protocol shall be identified by a name and a > possible versioning scheme. Guest will advertise the protocol and then > the hypervisor can assign the driver implementing the protocol if it is > registered in the host kernel. > > This patch series only implements the first 3 components. Guest side > implementation and the protocol framework shall come as a separate > series once we finalize rest of the design. > > This series also implements a sample bpf program and a kernel-builtin > pvsched drivers. They do not do any real stuff now, but just skeletons > to demonstrate the feature. > > Rebased on 6.8.2. > > [1]: https://lwn.net/Articles/955145/ > > Vineeth Pillai (Google) (5): > pvsched: paravirt scheduling framework > kvm: Implement the paravirt sched framework for kvm > kvm: interface for managing pvsched driver for guest VMs > pvsched: bpf support for pvsched > selftests/bpf: sample implementation of a bpf pvsched driver. > > Kconfig | 2 + > arch/x86/kvm/Kconfig | 13 + > arch/x86/kvm/x86.c | 3 + > include/linux/kvm_host.h | 32 +++ > include/linux/pvsched.h | 102 +++++++ > include/uapi/linux/kvm.h | 6 + > kernel/bpf/bpf_struct_ops_types.h | 4 + > kernel/sysctl.c | 27 ++ > .../testing/selftests/bpf/progs/bpf_pvsched.c | 37 +++ > virt/Makefile | 2 +- > virt/kvm/kvm_main.c | 265 ++++++++++++++++++ > virt/pvsched/Kconfig | 12 + > virt/pvsched/Makefile | 2 + > virt/pvsched/pvsched.c | 215 ++++++++++++++ > virt/pvsched/pvsched_bpf.c | 141 ++++++++++ > 15 files changed, 862 insertions(+), 1 deletion(-) > create mode 100644 include/linux/pvsched.h > create mode 100644 tools/testing/selftests/bpf/progs/bpf_pvsched.c > create mode 100644 virt/pvsched/Kconfig > create mode 100644 virt/pvsched/Makefile > create mode 100644 virt/pvsched/pvsched.c > create mode 100644 virt/pvsched/pvsched_bpf.c > > -- > 2.40.1 >
On Wed, Apr 03, 2024, Vineeth Pillai (Google) wrote: > Double scheduling is a concern with virtualization hosts where the host > schedules vcpus without knowing whats run by the vcpu and guest schedules > tasks without knowing where the vcpu is physically running. This causes > issues related to latencies, power consumption, resource utilization > etc. An ideal solution would be to have a cooperative scheduling > framework where the guest and host shares scheduling related information > and makes an educated scheduling decision to optimally handle the > workloads. As a first step, we are taking a stab at reducing latencies > for latency sensitive workloads in the guest. > > v1 RFC[1] was posted in December 2023. The main disagreement was in the > implementation where the patch was making scheduling policy decisions > in kvm and kvm is not the right place to do it. The suggestion was to > move the polcy decisions outside of kvm and let kvm only handle the > notifications needed to make the policy decisions. This patch series is > an iterative step towards implementing the feature as a layered > design where the policy could be implemented outside of kvm as a > kernel built-in, a kernel module or a bpf program. > > This design comprises mainly of 4 components: > > - pvsched driver: Implements the scheduling policies. Register with > host with a set of callbacks that hypervisor(kvm) can use to notify > vcpu events that the driver is interested in. The callback will be > passed in the address of shared memory so that the driver can get > scheduling information shared by the guest and also update the > scheduling policies set by the driver. > - kvm component: Selects the pvsched driver for a guest and notifies > the driver via callbacks for events that the driver is interested > in. Also interface with the guest in retreiving the shared memory > region for sharing the scheduling information. > - host kernel component: Implements the APIs for: > - pvsched driver for register/unregister to the host kernel, and > - hypervisor for assingning/unassigning driver for guests. > - guest component: Implements a framework for sharing the scheduling > information with the pvsched driver through kvm. Roughly summarazing an off-list discussion. - Discovery schedulers should be handled outside of KVM and the kernel, e.g. similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest. - "Negotiating" features/hooks should also be handled outside of the kernel, e.g. similar to how VirtIO devices negotiate features between host and guest. - Pushing PV scheduler entities to KVM should either be done through an exported API, e.g. if the scheduler is provided by a separate kernel module, or by a KVM or VM ioctl() (especially if the desire is to have per-VM schedulers). I think those were the main takeaways? Vineeth and Joel, please chime in on anything I've missed or misremembered. The other reason I'm bringing this discussion back on-list is that I (very) briefly discussed this with Paolo, and he pointed out the proposed rseq-based mechanism that would allow userspace to request an extended time slice[*], and that if that landed it would be easy-ish to reuse the interface for KVM's steal_time PV API. I see that you're both on that thread, so presumably you're already aware of the idea, but I wanted to bring it up here to make sure that we aren't trying to design something that's more complex than is needed. Specifically, if the guest has a generic way to request an extended time slice (or boost its priority?), would that address your use cases? Or rather, how close does it get you? E.g. the guest will have no way of requesting a larger time slice or boosting priority when an event is _pending_ but not yet receiveed by the guest, but is that actually problematic in practice? [*] https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home
> > This design comprises mainly of 4 components: > > > > - pvsched driver: Implements the scheduling policies. Register with > > host with a set of callbacks that hypervisor(kvm) can use to notify > > vcpu events that the driver is interested in. The callback will be > > passed in the address of shared memory so that the driver can get > > scheduling information shared by the guest and also update the > > scheduling policies set by the driver. > > - kvm component: Selects the pvsched driver for a guest and notifies > > the driver via callbacks for events that the driver is interested > > in. Also interface with the guest in retreiving the shared memory > > region for sharing the scheduling information. > > - host kernel component: Implements the APIs for: > > - pvsched driver for register/unregister to the host kernel, and > > - hypervisor for assingning/unassigning driver for guests. > > - guest component: Implements a framework for sharing the scheduling > > information with the pvsched driver through kvm. > > Roughly summarazing an off-list discussion. > > - Discovery schedulers should be handled outside of KVM and the kernel, e.g. > similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest. > > - "Negotiating" features/hooks should also be handled outside of the kernel, > e.g. similar to how VirtIO devices negotiate features between host and guest. > > - Pushing PV scheduler entities to KVM should either be done through an exported > API, e.g. if the scheduler is provided by a separate kernel module, or by a > KVM or VM ioctl() (especially if the desire is to have per-VM schedulers). > > I think those were the main takeaways? Vineeth and Joel, please chime in on > anything I've missed or misremembered. > Thanks for the brief about the offlist discussion, all the points are captured, just some minor additions. v2 implementation removed the scheduling policies outside of kvm to a separate entity called pvsched driver and could be implemented as a kernel module or bpf program. But the handshake between guest and host to decide on what pvsched driver to attach was still going through kvm. So it was suggested to move this handshake(discovery and negotiation) outside of kvm. The idea is to have a virtual device exposed by the VMM which would take care of the handshake. Guest driver for this device would talk to the device to understand the pvsched details on the host and pass the shared memory details. Once the handshake is completed, the device is responsible for loading the pvsched driver(bpf program or kernel module responsible for implementing the policies). The pvsched driver will register to the trace points exported by kvm and handle the callbacks from then on. The scheduling will be taken care of by the host scheduler, pvsched driver on host is responsible only for setting the policies(placement, priorities etc). With the above approach, the only change in kvm would be the internal tracepoints for pvsched. Host kernel will also be unchanged and all the complexities move to the VMM and the pvsched driver. Guest kernel will have a new driver to talk to the virtual pvsched device and this driver would hook into the guest kernel for passing scheduling information to the host(via tracepoints). > The other reason I'm bringing this discussion back on-list is that I (very) briefly > discussed this with Paolo, and he pointed out the proposed rseq-based mechanism > that would allow userspace to request an extended time slice[*], and that if that > landed it would be easy-ish to reuse the interface for KVM's steal_time PV API. > > I see that you're both on that thread, so presumably you're already aware of the > idea, but I wanted to bring it up here to make sure that we aren't trying to > design something that's more complex than is needed. > > Specifically, if the guest has a generic way to request an extended time slice > (or boost its priority?), would that address your use cases? Or rather, how close > does it get you? E.g. the guest will have no way of requesting a larger time > slice or boosting priority when an event is _pending_ but not yet receiveed by > the guest, but is that actually problematic in practice? > > [*] https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home > Thanks for bringing this up. We were also very much interested in this feature and were planning to use the pvmem shared memory instead of rseq framework for guests. The motivation of paravirt scheduling framework was a bit broader than the latency issues and hence we were proposing a bit more complex design. Other than the use case for temporarily extending the time slice of vcpus, we were also looking at vcpu placements on physical cpus, educated decisions that could be made by guest scheduler if it has a picture of host cpu load etc. Having a paravirt mechanism to share scheduling information would benefit in such cases. Once we have this framework setup, the policy implementation on guest and host could be taken care of by other entities like BPF programs, modules or schedulers like sched_ext. We are working on a v3 incorporating the above ideas and would shortly be posting a design RFC soon. Thanks for all the help and inputs on this. Thanks, Vineeth
> > Roughly summarazing an off-list discussion. > > > > - Discovery schedulers should be handled outside of KVM and the kernel, e.g. > > similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest. > > > > - "Negotiating" features/hooks should also be handled outside of the kernel, > > e.g. similar to how VirtIO devices negotiate features between host and guest. > > > > - Pushing PV scheduler entities to KVM should either be done through an exported > > API, e.g. if the scheduler is provided by a separate kernel module, or by a > > KVM or VM ioctl() (especially if the desire is to have per-VM schedulers). > > > > I think those were the main takeaways? Vineeth and Joel, please chime in on > > anything I've missed or misremembered. > > > Thanks for the brief about the offlist discussion, all the points are > captured, just some minor additions. v2 implementation removed the > scheduling policies outside of kvm to a separate entity called pvsched > driver and could be implemented as a kernel module or bpf program. But > the handshake between guest and host to decide on what pvsched driver > to attach was still going through kvm. So it was suggested to move > this handshake(discovery and negotiation) outside of kvm. The idea is > to have a virtual device exposed by the VMM which would take care of > the handshake. Guest driver for this device would talk to the device > to understand the pvsched details on the host and pass the shared > memory details. Once the handshake is completed, the device is > responsible for loading the pvsched driver(bpf program or kernel > module responsible for implementing the policies). The pvsched driver > will register to the trace points exported by kvm and handle the > callbacks from then on. The scheduling will be taken care of by the > host scheduler, pvsched driver on host is responsible only for setting > the policies(placement, priorities etc). > > With the above approach, the only change in kvm would be the internal > tracepoints for pvsched. Host kernel will also be unchanged and all > the complexities move to the VMM and the pvsched driver. Guest kernel > will have a new driver to talk to the virtual pvsched device and this > driver would hook into the guest kernel for passing scheduling > information to the host(via tracepoints). > Noting down the recent offlist discussion and details of our response. Based on the previous discussions, we had come up with a modified design focusing on minimum kvm changes. The design is as follows: - Guest and host share scheduling information via shared memory region. Details of the layout of the memory region, information shared and actions and policies are defined by the pvsched protocol. And this protocol is implemented by a BPF program or a kernel module. - Host exposes a virtual device(pvsched device to the guest). This device is the mechanism for host and guest for handshake and negotiation to reach a decision on the pvsched protocol to use. The virtual device is implemented in the VMM in userland as it doesn't come in the performance critical path. - Guest loads a pvsched driver during device enumeration. the driver initiates the protocol handshake and negotiation with the host and decides on the protocol. This driver creates a per-cpu shared memory region and shares the GFN with the device in the host. Guest also loads the BPF program that implements the protocol in the guest. - Once the VMM has all the information needed(per-cpu shared memory GFN, vcpu task pids etc), it loads the BPF program which implements the protocol on the host. - BPF program on the host registers the trace points in kvm to get callbacks on interested events like VMENTER, VMEXIT, interrupt injection etc. Similarly, the guest BPF program registers tracepoints in the guest kernel for interested events like sched wakeup, sched switch, enqueue, dequeue, irq entry/exit etc. The above design is minimally invasive to the kvm and core kernel and implements the protocol as loadable programs and protocol handshake and negotiation through the virtual device framework. Protocol implementation takes care of information sharing and policy enforcements and scheduler handles the actual scheduling decisions. Sample policy implementation(boosting for latency sensitive workloads as an example) could be included in the kernel for reference. We had an offlist discussion about the above design and a couple of ideas were suggested as an alternative. We had taken an action item to study the alternatives for the feasibility. Rest of the mail lists the use cases(not conclusive) and our feasibility investigations. Existing use cases ------------------------- - A latency sensitive workload on the guest might need more than one time slice to complete, but should not block any higher priority task in the host. In our design, the latency sensitive workload shares its priority requirements to host(RT priority, cfs nice value etc). Host implementation of the protocol sets the priority of the vcpu task accordingly so that the host scheduler can make an educated decision on the next task to run. This makes sure that host processes and vcpu tasks compete fairly for the cpu resource. - Guest should be able to notify the host that it is running a lower priority task so that the host can reschedule it if needed. As mentioned before, the guest shares the priority with the host and the host takes a better scheduling decision. - Proactive vcpu boosting for events like interrupt injection. Depending on the guest for boost request might be too late as the vcpu might not be scheduled to run even after interrupt injection. Host implementation of the protocol boosts the vcpu tasks priority so that it gets a better chance of immediately being scheduled and guest can handle the interrupt with minimal latency. Once the guest is done handling the interrupt, it can notify the host and lower the priority of the vcpu task. - Guests which assign specialized tasks to specific vcpus can share that information with the host so that host can try to avoid colocation of those cpus in a single physical cpu. for eg: there are interrupt pinning use cases where specific cpus are chosen to handle critical interrupts and passing this information to the host could be useful. - Another use case is the sharing of cpu capacity details between guest and host. Sharing the host cpu's load with the guest will enable the guest to schedule latency sensitive tasks on the best possible vcpu. This could be partially achievable by steal time, but steal time is more apparent on busy vcpus. There are workloads which are mostly sleepers, but wake up intermittently to serve short latency sensitive workloads. input event handlers in chrome is one such example. Data from the prototype implementation shows promising improvement in reducing latencies. Data was shared in the v1 cover letter. We have not implemented the capacity based placement policies yet, but plan to do that soon and have some real numbers to share. Ideas brought up during offlist discussion ------------------------------------------------------- 1. rseq based timeslice extension mechanism[1] While the rseq based mechanism helps in giving the vcpu task one more time slice, it will not help in the other use cases. We had a chat with Steve and the rseq mechanism was mainly for improving lock contention and would not work best with vcpu boosting considering all the use cases above. RT or high priority tasks in the VM would often need more than one time slice to complete its work and at the same, should not be hurting the host workloads. The goal for the above use cases is not requesting an extra slice, but to modify the priority in such a way that host processes and guest processes get a fair way to compete for cpu resources. This also means that vcpu task can request a lower priority when it is running lower priority tasks in the VM. 2. vDSO approach Regarding the vDSO approach, we had a look at that and feel that without a major redesign of vDSO, it might be difficult to achieve the requirements. vDSO is currently implemented as a shared read-only memory region with the processes. For this to work with virtualization, we would need to map a similar region to the guest and it has to be read-write. This is more or less what we are also proposing, but with minimal changes in the core kernel. With the current design, the shared memory region would be the responsibility of the virtual pvsched device framework. Sorry for the long mail. Please have a look and let us know your thoughts :-) Thanks, [1]: https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home/
On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote: > > > Roughly summarazing an off-list discussion. > > > > > > - Discovery schedulers should be handled outside of KVM and the kernel, e.g. > > > similar to how userspace uses PCI, VMBUS, etc. to enumerate devices to the guest. > > > > > > - "Negotiating" features/hooks should also be handled outside of the kernel, > > > e.g. similar to how VirtIO devices negotiate features between host and guest. > > > > > > - Pushing PV scheduler entities to KVM should either be done through an exported > > > API, e.g. if the scheduler is provided by a separate kernel module, or by a > > > KVM or VM ioctl() (especially if the desire is to have per-VM schedulers). > > > > > > I think those were the main takeaways? Vineeth and Joel, please chime in on > > > anything I've missed or misremembered. > > > > > Thanks for the brief about the offlist discussion, all the points are > > captured, just some minor additions. v2 implementation removed the > > scheduling policies outside of kvm to a separate entity called pvsched > > driver and could be implemented as a kernel module or bpf program. But > > the handshake between guest and host to decide on what pvsched driver > > to attach was still going through kvm. So it was suggested to move > > this handshake(discovery and negotiation) outside of kvm. The idea is > > to have a virtual device exposed by the VMM which would take care of > > the handshake. Guest driver for this device would talk to the device > > to understand the pvsched details on the host and pass the shared > > memory details. Once the handshake is completed, the device is > > responsible for loading the pvsched driver(bpf program or kernel > > module responsible for implementing the policies). The pvsched driver > > will register to the trace points exported by kvm and handle the > > callbacks from then on. The scheduling will be taken care of by the > > host scheduler, pvsched driver on host is responsible only for setting > > the policies(placement, priorities etc). > > > > With the above approach, the only change in kvm would be the internal > > tracepoints for pvsched. Host kernel will also be unchanged and all > > the complexities move to the VMM and the pvsched driver. Guest kernel > > will have a new driver to talk to the virtual pvsched device and this > > driver would hook into the guest kernel for passing scheduling > > information to the host(via tracepoints). > > > Noting down the recent offlist discussion and details of our response. > > Based on the previous discussions, we had come up with a modified > design focusing on minimum kvm changes. The design is as follows: > - Guest and host share scheduling information via shared memory > region. Details of the layout of the memory region, information shared > and actions and policies are defined by the pvsched protocol. And this > protocol is implemented by a BPF program or a kernel module. > - Host exposes a virtual device(pvsched device to the guest). This > device is the mechanism for host and guest for handshake and > negotiation to reach a decision on the pvsched protocol to use. The > virtual device is implemented in the VMM in userland as it doesn't > come in the performance critical path. > - Guest loads a pvsched driver during device enumeration. the driver > initiates the protocol handshake and negotiation with the host and > decides on the protocol. This driver creates a per-cpu shared memory > region and shares the GFN with the device in the host. Guest also > loads the BPF program that implements the protocol in the guest. > - Once the VMM has all the information needed(per-cpu shared memory > GFN, vcpu task pids etc), it loads the BPF program which implements > the protocol on the host. > - BPF program on the host registers the trace points in kvm to get > callbacks on interested events like VMENTER, VMEXIT, interrupt > injection etc. Similarly, the guest BPF program registers tracepoints > in the guest kernel for interested events like sched wakeup, sched > switch, enqueue, dequeue, irq entry/exit etc. > > The above design is minimally invasive to the kvm and core kernel and > implements the protocol as loadable programs and protocol handshake > and negotiation through the virtual device framework. Protocol > implementation takes care of information sharing and policy > enforcements and scheduler handles the actual scheduling decisions. > Sample policy implementation(boosting for latency sensitive workloads > as an example) could be included in the kernel for reference. > > We had an offlist discussion about the above design and a couple of > ideas were suggested as an alternative. We had taken an action item to > study the alternatives for the feasibility. Rest of the mail lists the > use cases(not conclusive) and our feasibility investigations. > > Existing use cases > ------------------------- > > - A latency sensitive workload on the guest might need more than one > time slice to complete, but should not block any higher priority task > in the host. In our design, the latency sensitive workload shares its > priority requirements to host(RT priority, cfs nice value etc). Host > implementation of the protocol sets the priority of the vcpu task > accordingly so that the host scheduler can make an educated decision > on the next task to run. This makes sure that host processes and vcpu > tasks compete fairly for the cpu resource. > - Guest should be able to notify the host that it is running a lower > priority task so that the host can reschedule it if needed. As > mentioned before, the guest shares the priority with the host and the > host takes a better scheduling decision. > - Proactive vcpu boosting for events like interrupt injection. > Depending on the guest for boost request might be too late as the vcpu > might not be scheduled to run even after interrupt injection. Host > implementation of the protocol boosts the vcpu tasks priority so that > it gets a better chance of immediately being scheduled and guest can > handle the interrupt with minimal latency. Once the guest is done > handling the interrupt, it can notify the host and lower the priority > of the vcpu task. > - Guests which assign specialized tasks to specific vcpus can share > that information with the host so that host can try to avoid > colocation of those cpus in a single physical cpu. for eg: there are > interrupt pinning use cases where specific cpus are chosen to handle > critical interrupts and passing this information to the host could be > useful. > - Another use case is the sharing of cpu capacity details between > guest and host. Sharing the host cpu's load with the guest will enable > the guest to schedule latency sensitive tasks on the best possible > vcpu. This could be partially achievable by steal time, but steal time > is more apparent on busy vcpus. There are workloads which are mostly > sleepers, but wake up intermittently to serve short latency sensitive > workloads. input event handlers in chrome is one such example. > > Data from the prototype implementation shows promising improvement in > reducing latencies. Data was shared in the v1 cover letter. We have > not implemented the capacity based placement policies yet, but plan to > do that soon and have some real numbers to share. > > Ideas brought up during offlist discussion > ------------------------------------------------------- > > 1. rseq based timeslice extension mechanism[1] > > While the rseq based mechanism helps in giving the vcpu task one more > time slice, it will not help in the other use cases. We had a chat > with Steve and the rseq mechanism was mainly for improving lock > contention and would not work best with vcpu boosting considering all > the use cases above. RT or high priority tasks in the VM would often > need more than one time slice to complete its work and at the same, > should not be hurting the host workloads. The goal for the above use > cases is not requesting an extra slice, but to modify the priority in > such a way that host processes and guest processes get a fair way to > compete for cpu resources. This also means that vcpu task can request > a lower priority when it is running lower priority tasks in the VM. I was looking at the rseq on request from the KVM call, however it does not make sense to me yet how to expose the rseq area via the Guest VA to the host kernel. rseq is for userspace to kernel, not VM to kernel. Steven Rostedt said as much as well, thoughts? Add Mathieu as well. This idea seems to suffer from the same vDSO over-engineering below, rseq does not seem to fit. Steven Rostedt told me, what we instead need is a tracepoint callback in a driver, that does the boosting. - Joel > > 2. vDSO approach > Regarding the vDSO approach, we had a look at that and feel that > without a major redesign of vDSO, it might be difficult to achieve the > requirements. vDSO is currently implemented as a shared read-only > memory region with the processes. For this to work with > virtualization, we would need to map a similar region to the guest and > it has to be read-write. This is more or less what we are also > proposing, but with minimal changes in the core kernel. With the > current design, the shared memory region would be the responsibility > of the virtual pvsched device framework. > > Sorry for the long mail. Please have a look and let us know your thoughts :-) > > Thanks, > > [1]: https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home/
On 2024-07-12 08:57, Joel Fernandes wrote: > On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote: [...] >> Existing use cases >> ------------------------- >> >> - A latency sensitive workload on the guest might need more than one >> time slice to complete, but should not block any higher priority task >> in the host. In our design, the latency sensitive workload shares its >> priority requirements to host(RT priority, cfs nice value etc). Host >> implementation of the protocol sets the priority of the vcpu task >> accordingly so that the host scheduler can make an educated decision >> on the next task to run. This makes sure that host processes and vcpu >> tasks compete fairly for the cpu resource. AFAIU, the information you need to convey to achieve this is the priority of the task within the guest. This information need to reach the host scheduler to make informed decision. One thing that is unclear about this is what is the acceptable overhead/latency to push this information from guest to host ? Is an hypercall OK or does it need to be exchanged over a memory mapping shared between guest and host ? Hypercalls provide simple ABIs across guest/host, and they allow the guest to immediately notify the host (similar to an interrupt). Shared memory mapping will require a carefully crafted ABI layout, and will only allow the host to use the information provided when the host runs. Therefore, if the choice is to share this information only through shared memory, the host scheduler will only be able to read it when it runs, so in hypercall, interrupt, and so on. >> - Guest should be able to notify the host that it is running a lower >> priority task so that the host can reschedule it if needed. As >> mentioned before, the guest shares the priority with the host and the >> host takes a better scheduling decision. It is unclear to me whether this information needs to be "pushed" from guest to host (e.g. hypercall) in a way that allows the host to immediately act on this information, or if it is OK to have the host read this information when its scheduler happens to run. >> - Proactive vcpu boosting for events like interrupt injection. >> Depending on the guest for boost request might be too late as the vcpu >> might not be scheduled to run even after interrupt injection. Host >> implementation of the protocol boosts the vcpu tasks priority so that >> it gets a better chance of immediately being scheduled and guest can >> handle the interrupt with minimal latency. Once the guest is done >> handling the interrupt, it can notify the host and lower the priority >> of the vcpu task. This appears to be a scenario where the host sets a "high priority", and the guest clears it when it is done with the irq handler. I guess it can be done either ways (hypercall or shared memory), but the choice would depend on the parameters identified above: acceptable overhead vs acceptable latency to inform the host scheduler. >> - Guests which assign specialized tasks to specific vcpus can share >> that information with the host so that host can try to avoid >> colocation of those cpus in a single physical cpu. for eg: there are >> interrupt pinning use cases where specific cpus are chosen to handle >> critical interrupts and passing this information to the host could be >> useful. How frequently is this topology expected to change ? Is it something that is set once when the guest starts and then is fixed ? How often it changes will likely affect the tradeoffs here. >> - Another use case is the sharing of cpu capacity details between >> guest and host. Sharing the host cpu's load with the guest will enable >> the guest to schedule latency sensitive tasks on the best possible >> vcpu. This could be partially achievable by steal time, but steal time >> is more apparent on busy vcpus. There are workloads which are mostly >> sleepers, but wake up intermittently to serve short latency sensitive >> workloads. input event handlers in chrome is one such example. OK so for this use-case information goes the other way around: from host to guest. Here the shared mapping seems better than polling the state through an hypercall. >> >> Data from the prototype implementation shows promising improvement in >> reducing latencies. Data was shared in the v1 cover letter. We have >> not implemented the capacity based placement policies yet, but plan to >> do that soon and have some real numbers to share. >> >> Ideas brought up during offlist discussion >> ------------------------------------------------------- >> >> 1. rseq based timeslice extension mechanism[1] >> >> While the rseq based mechanism helps in giving the vcpu task one more >> time slice, it will not help in the other use cases. We had a chat >> with Steve and the rseq mechanism was mainly for improving lock >> contention and would not work best with vcpu boosting considering all >> the use cases above. RT or high priority tasks in the VM would often >> need more than one time slice to complete its work and at the same, >> should not be hurting the host workloads. The goal for the above use >> cases is not requesting an extra slice, but to modify the priority in >> such a way that host processes and guest processes get a fair way to >> compete for cpu resources. This also means that vcpu task can request >> a lower priority when it is running lower priority tasks in the VM. > > I was looking at the rseq on request from the KVM call, however it does not > make sense to me yet how to expose the rseq area via the Guest VA to the host > kernel. rseq is for userspace to kernel, not VM to kernel. > > Steven Rostedt said as much as well, thoughts? Add Mathieu as well. I'm not sure that rseq would help at all here, but I think we may want to borrow concepts of data sitting in shared memory across privilege levels and apply them to VMs. If some of the ideas end up being useful *outside* of the context of VMs, then I'd be willing to consider adding fields to rseq. But as long as it is VM-specific, I suspect you'd be better with dedicated per-vcpu pages which you can safely share across host/guest kernels. > > This idea seems to suffer from the same vDSO over-engineering below, rseq > does not seem to fit. > > Steven Rostedt told me, what we instead need is a tracepoint callback in a > driver, that does the boosting. I utterly dislike changing the system behavior through tracepoints. They were designed to observe the system, not modify its behavior. If people start abusing them, then subsystem maintainers will stop adding them. Please don't do that. Add a notifier or think about integrating what you are planning to add into the driver instead. Thanks, Mathieu
On Fri, Jul 12, 2024, Mathieu Desnoyers wrote: > On 2024-07-12 08:57, Joel Fernandes wrote: > > On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote: > [...] > > > Existing use cases > > > ------------------------- > > > > > > - A latency sensitive workload on the guest might need more than one > > > time slice to complete, but should not block any higher priority task > > > in the host. In our design, the latency sensitive workload shares its > > > priority requirements to host(RT priority, cfs nice value etc). Host > > > implementation of the protocol sets the priority of the vcpu task > > > accordingly so that the host scheduler can make an educated decision > > > on the next task to run. This makes sure that host processes and vcpu > > > tasks compete fairly for the cpu resource. > > AFAIU, the information you need to convey to achieve this is the priority > of the task within the guest. This information need to reach the host > scheduler to make informed decision. > > One thing that is unclear about this is what is the acceptable > overhead/latency to push this information from guest to host ? > Is an hypercall OK or does it need to be exchanged over a memory > mapping shared between guest and host ? > > Hypercalls provide simple ABIs across guest/host, and they allow > the guest to immediately notify the host (similar to an interrupt). Hypercalls have myriad problems. They require a VM-Exit, which largely defeats the purpose of boosting the vCPU priority for performance reasons. They don't allow for delegation as there's no way for the hypervisor to know if a hypercall from guest userspace should be allowed, versus anything memory based where the ability for guest userspace to access the memory demonstrates permission (else the guest kernel wouldn't have mapped the memory into userspace). > > > Ideas brought up during offlist discussion > > > ------------------------------------------------------- > > > > > > 1. rseq based timeslice extension mechanism[1] > > > > > > While the rseq based mechanism helps in giving the vcpu task one more > > > time slice, it will not help in the other use cases. We had a chat > > > with Steve and the rseq mechanism was mainly for improving lock > > > contention and would not work best with vcpu boosting considering all > > > the use cases above. RT or high priority tasks in the VM would often > > > need more than one time slice to complete its work and at the same, > > > should not be hurting the host workloads. The goal for the above use > > > cases is not requesting an extra slice, but to modify the priority in > > > such a way that host processes and guest processes get a fair way to > > > compete for cpu resources. This also means that vcpu task can request > > > a lower priority when it is running lower priority tasks in the VM. Then figure out a way to let userspace boot a task's priority without needing a syscall. vCPUs are not directly schedulable entities, the task doing KVM_RUN on the vCPU fd is what the scheduler sees. Any scheduling enhancement that benefits vCPUs by definition can benefit userspace tasks. > > I was looking at the rseq on request from the KVM call, however it does not > > make sense to me yet how to expose the rseq area via the Guest VA to the host > > kernel. rseq is for userspace to kernel, not VM to kernel. Any memory that is exposed to host userspace can be exposed to the guest. Things like this are implemented via "overlay" pages, where the guest asks host userspace to map the magic page (rseq in this case) at GPA 'x'. Userspace then creates a memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the address of the page containing the rseq structure associated with the vCPU (in pretty much every modern VMM, each vCPU has a dedicated task/thread). A that point, the vCPU can read/write the rseq structure directly. The reason us KVM folks are pushing y'all towards something like rseq is that (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU is actually just priority boosting a task. So rather than invent something virtualization specific, invent a mechanism for priority boosting from userspace without a syscall, and then extend it to the virtualization use case. > > Steven Rostedt said as much as well, thoughts? Add Mathieu as well. > > I'm not sure that rseq would help at all here, but I think we may want to > borrow concepts of data sitting in shared memory across privilege levels > and apply them to VMs. > > If some of the ideas end up being useful *outside* of the context of VMs, Modulo the assertion above that this is is about boosting priority instead of requesting an extended time slice, this is essentially the same thing as the "delay resched" discussion[*]. The only difference is that the vCPU is in a critical section, e.q. IRQ handler, versus the userspace task being in a critical section. [*] https://lore.kernel.org/all/20231025054219.1acaa3dd@gandalf.local.home > then I'd be willing to consider adding fields to rseq. But as long as it is > VM-specific, I suspect you'd be better with dedicated per-vcpu pages which > you can safely share across host/guest kernels.
On 2024-07-12 10:48, Sean Christopherson wrote: > On Fri, Jul 12, 2024, Mathieu Desnoyers wrote: >> On 2024-07-12 08:57, Joel Fernandes wrote: >>> On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote: >> [...] >>>> Existing use cases >>>> ------------------------- >>>> >>>> - A latency sensitive workload on the guest might need more than one >>>> time slice to complete, but should not block any higher priority task >>>> in the host. In our design, the latency sensitive workload shares its >>>> priority requirements to host(RT priority, cfs nice value etc). Host >>>> implementation of the protocol sets the priority of the vcpu task >>>> accordingly so that the host scheduler can make an educated decision >>>> on the next task to run. This makes sure that host processes and vcpu >>>> tasks compete fairly for the cpu resource. >> >> AFAIU, the information you need to convey to achieve this is the priority >> of the task within the guest. This information need to reach the host >> scheduler to make informed decision. >> >> One thing that is unclear about this is what is the acceptable >> overhead/latency to push this information from guest to host ? >> Is an hypercall OK or does it need to be exchanged over a memory >> mapping shared between guest and host ? >> >> Hypercalls provide simple ABIs across guest/host, and they allow >> the guest to immediately notify the host (similar to an interrupt). > > Hypercalls have myriad problems. They require a VM-Exit, which largely defeats > the purpose of boosting the vCPU priority for performance reasons. They don't > allow for delegation as there's no way for the hypervisor to know if a hypercall > from guest userspace should be allowed, versus anything memory based where the > ability for guest userspace to access the memory demonstrates permission (else > the guest kernel wouldn't have mapped the memory into userspace). OK, this answers my question above: the overhead of the hypercall pretty much defeats the purpose of this priority boosting. > >>>> Ideas brought up during offlist discussion >>>> ------------------------------------------------------- >>>> >>>> 1. rseq based timeslice extension mechanism[1] >>>> >>>> While the rseq based mechanism helps in giving the vcpu task one more >>>> time slice, it will not help in the other use cases. We had a chat >>>> with Steve and the rseq mechanism was mainly for improving lock >>>> contention and would not work best with vcpu boosting considering all >>>> the use cases above. RT or high priority tasks in the VM would often >>>> need more than one time slice to complete its work and at the same, >>>> should not be hurting the host workloads. The goal for the above use >>>> cases is not requesting an extra slice, but to modify the priority in >>>> such a way that host processes and guest processes get a fair way to >>>> compete for cpu resources. This also means that vcpu task can request >>>> a lower priority when it is running lower priority tasks in the VM. > > Then figure out a way to let userspace boot a task's priority without needing a > syscall. vCPUs are not directly schedulable entities, the task doing KVM_RUN > on the vCPU fd is what the scheduler sees. Any scheduling enhancement that > benefits vCPUs by definition can benefit userspace tasks. Yes. > >>> I was looking at the rseq on request from the KVM call, however it does not >>> make sense to me yet how to expose the rseq area via the Guest VA to the host >>> kernel. rseq is for userspace to kernel, not VM to kernel. > > Any memory that is exposed to host userspace can be exposed to the guest. Things > like this are implemented via "overlay" pages, where the guest asks host userspace > to map the magic page (rseq in this case) at GPA 'x'. Userspace then creates a > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the > address of the page containing the rseq structure associated with the vCPU (in > pretty much every modern VMM, each vCPU has a dedicated task/thread). > > A that point, the vCPU can read/write the rseq structure directly. This helps me understand what you are trying to achieve. I disagree with some aspects of the design you present above: mainly the lack of isolation between the guest kernel and the host task doing the KVM_RUN. We do not want to let the guest kernel store to rseq fields that would result in getting the host task killed (e.g. a bogus rseq_cs pointer). But this is something we can improve upon once we understand what we are trying to achieve. > > The reason us KVM folks are pushing y'all towards something like rseq is that > (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU > is actually just priority boosting a task. So rather than invent something > virtualization specific, invent a mechanism for priority boosting from userspace > without a syscall, and then extend it to the virtualization use case. > [...] OK, so how about we expose "offsets" tuning the base values ? - The task doing KVM_RUN, just like any other task, has its "priority" value as set by setpriority(2). - We introduce two new fields in the per-thread struct rseq, which is mapped in the host task doing KVM_RUN and readable from the scheduler: - __s32 prio_offset; /* Priority offset to apply on the current task priority. */ - __u64 vcpu_sched; /* Pointer to a struct vcpu_sched in user-space */ vcpu_sched would be a userspace pointer to a new vcpu_sched structure, which would be typically NULL except for tasks doing KVM_RUN. This would sit in its own pages per vcpu, which takes care of isolation between guest kernel and host process. Those would be RW by the guest kernel as well and contain e.g.: struct vcpu_sched { __u32 len; /* Length of active fields. */ __s32 prio_offset; __s32 cpu_capacity_offset; [...] }; So when the host kernel try to calculate the effective priority of a task doing KVM_RUN, it would basically start from its current priority, and offset by (rseq->prio_offset + rseq->vcpu_sched->prio_offset). The cpu_capacity_offset would be populated by the host kernel and read by the guest kernel scheduler for scheduling/migration decisions. I'm certainly missing details about how priority offsets should be bounded for given tasks. This could be an extension to setrlimit(2). Thoughts ? Thanks, Mathieu
On Fri, Jul 12, 2024, Mathieu Desnoyers wrote: > On 2024-07-12 10:48, Sean Christopherson wrote: > > > > I was looking at the rseq on request from the KVM call, however it does not > > > > make sense to me yet how to expose the rseq area via the Guest VA to the host > > > > kernel. rseq is for userspace to kernel, not VM to kernel. > > > > Any memory that is exposed to host userspace can be exposed to the guest. Things > > like this are implemented via "overlay" pages, where the guest asks host userspace > > to map the magic page (rseq in this case) at GPA 'x'. Userspace then creates a > > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the > > address of the page containing the rseq structure associated with the vCPU (in > > pretty much every modern VMM, each vCPU has a dedicated task/thread). > > > > A that point, the vCPU can read/write the rseq structure directly. > > This helps me understand what you are trying to achieve. I disagree with > some aspects of the design you present above: mainly the lack of > isolation between the guest kernel and the host task doing the KVM_RUN. > We do not want to let the guest kernel store to rseq fields that would > result in getting the host task killed (e.g. a bogus rseq_cs pointer). Yeah, exposing the full rseq structure to the guest is probably a terrible idea. The above isn't intended to be a design, the goal is just to illustrate how an rseq-like mechanism can be extended to the guest without needing virtualization specific ABI and without needing new KVM functionality. > But this is something we can improve upon once we understand what we > are trying to achieve. > > > > > The reason us KVM folks are pushing y'all towards something like rseq is that > > (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU > > is actually just priority boosting a task. So rather than invent something > > virtualization specific, invent a mechanism for priority boosting from userspace > > without a syscall, and then extend it to the virtualization use case. > > > [...] > > OK, so how about we expose "offsets" tuning the base values ? > > - The task doing KVM_RUN, just like any other task, has its "priority" > value as set by setpriority(2). > > - We introduce two new fields in the per-thread struct rseq, which is > mapped in the host task doing KVM_RUN and readable from the scheduler: > > - __s32 prio_offset; /* Priority offset to apply on the current task priority. */ > > - __u64 vcpu_sched; /* Pointer to a struct vcpu_sched in user-space */ Ideally, there won't be a virtualization specific structure. A vCPU specific field might make sense (or it might not), but I really want to avoid defining a structure that is unique to virtualization. E.g. a userspace doing M:N scheduling can likely benefit from any capacity hooks/information that would benefit a guest scheduler. I.e. rather than a vcpu_sched structure, have a user_sched structure (or whatever name makes sense), and then have two struct pointers in rseq. Though I'm skeptical that having two structs in play would be necessary or sane. E.g. if both userspace and guest can adjust priority, then they'll need to coordinate in order to avoid unexpected results. I can definitely see wanting to let the userspace VMM bound the priority of a vCPU, but that should be a relatively static decision, i.e. can be done via syscall or something similarly "slow". > vcpu_sched would be a userspace pointer to a new vcpu_sched structure, > which would be typically NULL except for tasks doing KVM_RUN. This would > sit in its own pages per vcpu, which takes care of isolation between guest > kernel and host process. Those would be RW by the guest kernel as > well and contain e.g.: > > struct vcpu_sched { > __u32 len; /* Length of active fields. */ > > __s32 prio_offset; > __s32 cpu_capacity_offset; > [...] > }; > > So when the host kernel try to calculate the effective priority of a task > doing KVM_RUN, it would basically start from its current priority, and offset > by (rseq->prio_offset + rseq->vcpu_sched->prio_offset). > > The cpu_capacity_offset would be populated by the host kernel and read by the > guest kernel scheduler for scheduling/migration decisions. > > I'm certainly missing details about how priority offsets should be bounded for > given tasks. This could be an extension to setrlimit(2). > > Thoughts ? > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On Fri, 12 Jul 2024 10:09:03 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > > Steven Rostedt told me, what we instead need is a tracepoint callback in a > > driver, that does the boosting. > > I utterly dislike changing the system behavior through tracepoints. They were > designed to observe the system, not modify its behavior. If people start abusing > them, then subsystem maintainers will stop adding them. Please don't do that. > Add a notifier or think about integrating what you are planning to add into the > driver instead. I tend to agree that a notifier would be much better than using tracepoints, but then I also think eBPF has already let that cat out of the bag. :-p All we need is a notifier that gets called at every VMEXIT. The main issue that this is trying to solve is to boost the priority of the guest without making the hypercall, so that it can quickly react (lower the latency of reaction to an event). Now when the task is unboosted, there's no avoiding of the hypercall as there's no other way to tell the host that this vCPU should not be running at a higher priority (the high priority may prevent schedules, or even checking the new prio in the shared memory). If there's a way to have a shared memory, via virtio or whatever, and any notifier that gets called at any VMEXIT, then this is trivial to implement. -- Steve
On Fri, Jul 12, 2024 at 10:09 AM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2024-07-12 08:57, Joel Fernandes wrote: > > On Mon, Jun 24, 2024 at 07:01:19AM -0400, Vineeth Remanan Pillai wrote: > [...] > >> Existing use cases > >> ------------------------- > >> > >> - A latency sensitive workload on the guest might need more than one > >> time slice to complete, but should not block any higher priority task > >> in the host. In our design, the latency sensitive workload shares its > >> priority requirements to host(RT priority, cfs nice value etc). Host > >> implementation of the protocol sets the priority of the vcpu task > >> accordingly so that the host scheduler can make an educated decision > >> on the next task to run. This makes sure that host processes and vcpu > >> tasks compete fairly for the cpu resource. > > AFAIU, the information you need to convey to achieve this is the priority > of the task within the guest. This information need to reach the host > scheduler to make informed decision. > > One thing that is unclear about this is what is the acceptable > overhead/latency to push this information from guest to host ? > Is an hypercall OK or does it need to be exchanged over a memory > mapping shared between guest and host ? Shared memory for the boost (Can do it later during host preemption). But for unboost, we possibly need a hypercall in addition to it as well. > > Hypercalls provide simple ABIs across guest/host, and they allow > the guest to immediately notify the host (similar to an interrupt). > > Shared memory mapping will require a carefully crafted ABI layout, > and will only allow the host to use the information provided when > the host runs. Therefore, if the choice is to share this information > only through shared memory, the host scheduler will only be able to > read it when it runs, so in hypercall, interrupt, and so on. The initial idea was to handle the details/format/allocation of the shared memory out-of-band in a driver, but then later the rseq idea came up. > >> - Guest should be able to notify the host that it is running a lower > >> priority task so that the host can reschedule it if needed. As > >> mentioned before, the guest shares the priority with the host and the > >> host takes a better scheduling decision. > > It is unclear to me whether this information needs to be "pushed" > from guest to host (e.g. hypercall) in a way that allows the host > to immediately act on this information, or if it is OK to have the > host read this information when its scheduler happens to run. For boosting, there is no need to immediately push. Only on preemption. > >> - Proactive vcpu boosting for events like interrupt injection. > >> Depending on the guest for boost request might be too late as the vcpu > >> might not be scheduled to run even after interrupt injection. Host > >> implementation of the protocol boosts the vcpu tasks priority so that > >> it gets a better chance of immediately being scheduled and guest can > >> handle the interrupt with minimal latency. Once the guest is done > >> handling the interrupt, it can notify the host and lower the priority > >> of the vcpu task. > > This appears to be a scenario where the host sets a "high priority", and > the guest clears it when it is done with the irq handler. I guess it can > be done either ways (hypercall or shared memory), but the choice would > depend on the parameters identified above: acceptable overhead vs acceptable > latency to inform the host scheduler. Yes, we have found ways to reduce/make fewer hypercalls on unboost. > >> - Guests which assign specialized tasks to specific vcpus can share > >> that information with the host so that host can try to avoid > >> colocation of those cpus in a single physical cpu. for eg: there are > >> interrupt pinning use cases where specific cpus are chosen to handle > >> critical interrupts and passing this information to the host could be > >> useful. > > How frequently is this topology expected to change ? Is it something that > is set once when the guest starts and then is fixed ? How often it changes > will likely affect the tradeoffs here. Yes, will be fixed. > >> - Another use case is the sharing of cpu capacity details between > >> guest and host. Sharing the host cpu's load with the guest will enable > >> the guest to schedule latency sensitive tasks on the best possible > >> vcpu. This could be partially achievable by steal time, but steal time > >> is more apparent on busy vcpus. There are workloads which are mostly > >> sleepers, but wake up intermittently to serve short latency sensitive > >> workloads. input event handlers in chrome is one such example. > > OK so for this use-case information goes the other way around: from host > to guest. Here the shared mapping seems better than polling the state > through an hypercall. Yes, FWIW this particular part is for future and not initially required per-se. > >> Data from the prototype implementation shows promising improvement in > >> reducing latencies. Data was shared in the v1 cover letter. We have > >> not implemented the capacity based placement policies yet, but plan to > >> do that soon and have some real numbers to share. > >> > >> Ideas brought up during offlist discussion > >> ------------------------------------------------------- > >> > >> 1. rseq based timeslice extension mechanism[1] > >> > >> While the rseq based mechanism helps in giving the vcpu task one more > >> time slice, it will not help in the other use cases. We had a chat > >> with Steve and the rseq mechanism was mainly for improving lock > >> contention and would not work best with vcpu boosting considering all > >> the use cases above. RT or high priority tasks in the VM would often > >> need more than one time slice to complete its work and at the same, > >> should not be hurting the host workloads. The goal for the above use > >> cases is not requesting an extra slice, but to modify the priority in > >> such a way that host processes and guest processes get a fair way to > >> compete for cpu resources. This also means that vcpu task can request > >> a lower priority when it is running lower priority tasks in the VM. > > > > I was looking at the rseq on request from the KVM call, however it does not > > make sense to me yet how to expose the rseq area via the Guest VA to the host > > kernel. rseq is for userspace to kernel, not VM to kernel. > > > > Steven Rostedt said as much as well, thoughts? Add Mathieu as well. > > I'm not sure that rseq would help at all here, but I think we may want to > borrow concepts of data sitting in shared memory across privilege levels > and apply them to VMs. > > If some of the ideas end up being useful *outside* of the context of VMs, > then I'd be willing to consider adding fields to rseq. But as long as it is > VM-specific, I suspect you'd be better with dedicated per-vcpu pages which > you can safely share across host/guest kernels. Yes, this was the initial plan. I also feel rseq cannot be applied here. > > This idea seems to suffer from the same vDSO over-engineering below, rseq > > does not seem to fit. > > > > Steven Rostedt told me, what we instead need is a tracepoint callback in a > > driver, that does the boosting. > > I utterly dislike changing the system behavior through tracepoints. They were > designed to observe the system, not modify its behavior. If people start abusing > them, then subsystem maintainers will stop adding them. Please don't do that. > Add a notifier or think about integrating what you are planning to add into the > driver instead. Well, we do have "raw" tracepoints not accessible from userspace, so you're saying even those are off limits for adding callbacks? - Joel
On Fri, 12 Jul 2024 11:32:30 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >>> I was looking at the rseq on request from the KVM call, however it does not > >>> make sense to me yet how to expose the rseq area via the Guest VA to the host > >>> kernel. rseq is for userspace to kernel, not VM to kernel. > > > > Any memory that is exposed to host userspace can be exposed to the guest. Things > > like this are implemented via "overlay" pages, where the guest asks host userspace > > to map the magic page (rseq in this case) at GPA 'x'. Userspace then creates a > > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the > > address of the page containing the rseq structure associated with the vCPU (in > > pretty much every modern VMM, each vCPU has a dedicated task/thread). > > > > A that point, the vCPU can read/write the rseq structure directly. So basically, the vCPU thread can just create a virtio device that exposes the rseq memory to the guest kernel? One other issue we need to worry about is that IIUC rseq memory is allocated by the guest/user, not the host kernel. This means it can be swapped out. The code that handles this needs to be able to handle user page faults. > > This helps me understand what you are trying to achieve. I disagree with > some aspects of the design you present above: mainly the lack of > isolation between the guest kernel and the host task doing the KVM_RUN. > We do not want to let the guest kernel store to rseq fields that would > result in getting the host task killed (e.g. a bogus rseq_cs pointer). > But this is something we can improve upon once we understand what we > are trying to achieve. > > > > > The reason us KVM folks are pushing y'all towards something like rseq is that > > (again, in any modern VMM) vCPUs are just tasks, i.e. priority boosting a vCPU > > is actually just priority boosting a task. So rather than invent something > > virtualization specific, invent a mechanism for priority boosting from userspace > > without a syscall, and then extend it to the virtualization use case. > > > [...] > > OK, so how about we expose "offsets" tuning the base values ? > > - The task doing KVM_RUN, just like any other task, has its "priority" > value as set by setpriority(2). > > - We introduce two new fields in the per-thread struct rseq, which is > mapped in the host task doing KVM_RUN and readable from the scheduler: > > - __s32 prio_offset; /* Priority offset to apply on the current task priority. */ > > - __u64 vcpu_sched; /* Pointer to a struct vcpu_sched in user-space */ > > vcpu_sched would be a userspace pointer to a new vcpu_sched structure, > which would be typically NULL except for tasks doing KVM_RUN. This would > sit in its own pages per vcpu, which takes care of isolation between guest > kernel and host process. Those would be RW by the guest kernel as > well and contain e.g.: Hmm, maybe not make this only vcpu specific, but perhaps this can be useful for user space tasks that want to dynamically change their priority without a system call. It could do the same thing. Yeah, yeah, I may be coming up with a solution in search of a problem ;-) -- Steve > > struct vcpu_sched { > __u32 len; /* Length of active fields. */ > > __s32 prio_offset; > __s32 cpu_capacity_offset; > [...] > }; > > So when the host kernel try to calculate the effective priority of a task > doing KVM_RUN, it would basically start from its current priority, and offset > by (rseq->prio_offset + rseq->vcpu_sched->prio_offset). > > The cpu_capacity_offset would be populated by the host kernel and read by the > guest kernel scheduler for scheduling/migration decisions. > > I'm certainly missing details about how priority offsets should be bounded for > given tasks. This could be an extension to setrlimit(2). > > Thoughts ? > > Thanks, > > Mathieu >
On Fri, Jul 12, 2024, Steven Rostedt wrote: > On Fri, 12 Jul 2024 11:32:30 -0400 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > >>> I was looking at the rseq on request from the KVM call, however it does not > > >>> make sense to me yet how to expose the rseq area via the Guest VA to the host > > >>> kernel. rseq is for userspace to kernel, not VM to kernel. > > > > > > Any memory that is exposed to host userspace can be exposed to the guest. Things > > > like this are implemented via "overlay" pages, where the guest asks host userspace > > > to map the magic page (rseq in this case) at GPA 'x'. Userspace then creates a > > > memslot that overlays guest RAM to map GPA 'x' to host VA 'y', where 'y' is the > > > address of the page containing the rseq structure associated with the vCPU (in > > > pretty much every modern VMM, each vCPU has a dedicated task/thread). > > > > > > A that point, the vCPU can read/write the rseq structure directly. > > So basically, the vCPU thread can just create a virtio device that > exposes the rseq memory to the guest kernel? > > One other issue we need to worry about is that IIUC rseq memory is > allocated by the guest/user, not the host kernel. This means it can be > swapped out. The code that handles this needs to be able to handle user > page faults. This is a non-issue, it will Just Work, same as any other memory that is exposed to the guest and can be reclaimed/swapped/migrated.. If the host swaps out the rseq page, mmu_notifiers will call into KVM and KVM will unmap the page from the guest. If/when the page is accessed by the guest, KVM will fault the page back into the host's primary MMU, and then map the new pfn into the guest.
On Fri, Jul 12, 2024, Steven Rostedt wrote: > On Fri, 12 Jul 2024 10:09:03 -0400 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > > > > > Steven Rostedt told me, what we instead need is a tracepoint callback in a > > > driver, that does the boosting. > > > > I utterly dislike changing the system behavior through tracepoints. They were > > designed to observe the system, not modify its behavior. If people start abusing > > them, then subsystem maintainers will stop adding them. Please don't do that. > > Add a notifier or think about integrating what you are planning to add into the > > driver instead. > > I tend to agree that a notifier would be much better than using > tracepoints, but then I also think eBPF has already let that cat out of > the bag. :-p > > All we need is a notifier that gets called at every VMEXIT. Why? The only argument I've seen for needing to hook VM-Exit is so that the host can speculatively boost the priority of the vCPU when deliverying an IRQ, but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on modern hardware that supports posted interrupts and IPI virtualization, i.e. for which there will be no VM-Exit.
On Fri, Jul 12, 2024 at 09:44:16AM -0700, Sean Christopherson wrote: > On Fri, Jul 12, 2024, Steven Rostedt wrote: > > On Fri, 12 Jul 2024 10:09:03 -0400 > > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > > > > > > > > Steven Rostedt told me, what we instead need is a tracepoint callback in a > > > > driver, that does the boosting. > > > > > > I utterly dislike changing the system behavior through tracepoints. They were > > > designed to observe the system, not modify its behavior. If people start abusing > > > them, then subsystem maintainers will stop adding them. Please don't do that. > > > Add a notifier or think about integrating what you are planning to add into the > > > driver instead. > > > > I tend to agree that a notifier would be much better than using > > tracepoints, but then I also think eBPF has already let that cat out of > > the bag. :-p > > > > All we need is a notifier that gets called at every VMEXIT. > > Why? The only argument I've seen for needing to hook VM-Exit is so that the > host can speculatively boost the priority of the vCPU when deliverying an IRQ, > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on > modern hardware that supports posted interrupts and IPI virtualization, i.e. for > which there will be no VM-Exit. I am a bit confused by your statement Sean, because if a higher prio HOST thread wakes up on the vCPU thread's phyiscal CPU, then a VM-Exit should happen. That has nothing to do with IRQ delivery. What am I missing? thanks, - Joel
On Fri, 12 Jul 2024 09:39:52 -0700 Sean Christopherson <seanjc@google.com> wrote: > > > > One other issue we need to worry about is that IIUC rseq memory is > > allocated by the guest/user, not the host kernel. This means it can be > > swapped out. The code that handles this needs to be able to handle user > > page faults. > > This is a non-issue, it will Just Work, same as any other memory that is exposed > to the guest and can be reclaimed/swapped/migrated.. > > If the host swaps out the rseq page, mmu_notifiers will call into KVM and KVM will > unmap the page from the guest. If/when the page is accessed by the guest, KVM > will fault the page back into the host's primary MMU, and then map the new pfn > into the guest. My comment is that in the host kernel, the access to this memory needs to be user page fault safe. You can't call it in atomic context. -- Steve
On Fri, Jul 12, 2024, Joel Fernandes wrote: > On Fri, Jul 12, 2024 at 09:44:16AM -0700, Sean Christopherson wrote: > > On Fri, Jul 12, 2024, Steven Rostedt wrote: > > > On Fri, 12 Jul 2024 10:09:03 -0400 > > > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > > > > > > > > > > > Steven Rostedt told me, what we instead need is a tracepoint callback in a > > > > > driver, that does the boosting. > > > > > > > > I utterly dislike changing the system behavior through tracepoints. They were > > > > designed to observe the system, not modify its behavior. If people start abusing > > > > them, then subsystem maintainers will stop adding them. Please don't do that. > > > > Add a notifier or think about integrating what you are planning to add into the > > > > driver instead. > > > > > > I tend to agree that a notifier would be much better than using > > > tracepoints, but then I also think eBPF has already let that cat out of > > > the bag. :-p > > > > > > All we need is a notifier that gets called at every VMEXIT. > > > > Why? The only argument I've seen for needing to hook VM-Exit is so that the > > host can speculatively boost the priority of the vCPU when deliverying an IRQ, > > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted > > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on > > modern hardware that supports posted interrupts and IPI virtualization, i.e. for > > which there will be no VM-Exit. > > I am a bit confused by your statement Sean, because if a higher prio HOST > thread wakes up on the vCPU thread's phyiscal CPU, then a VM-Exit should > happen. That has nothing to do with IRQ delivery. What am I missing? Why does that require hooking VM-Exit?
On Fri, 12 Jul 2024 09:44:16 -0700 Sean Christopherson <seanjc@google.com> wrote: > > All we need is a notifier that gets called at every VMEXIT. > > Why? The only argument I've seen for needing to hook VM-Exit is so that the > host can speculatively boost the priority of the vCPU when deliverying an IRQ, > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on > modern hardware that supports posted interrupts and IPI virtualization, i.e. for > which there will be no VM-Exit. No. The speculatively boost was for something else, but slightly related. I guess the ideal there was to have the interrupt coming in boost the vCPU because the interrupt could be waking an RT task. It may still be something needed, but that's not what I'm talking about here. The idea here is when an RT task is scheduled in on the guest, we want to lazily boost it. As long as the vCPU is running on the CPU, we do not need to do anything. If the RT task is scheduled for a very short time, it should not need to call any hypercall. It would set the shared memory to the new priority when the RT task is scheduled, and then put back the lower priority when it is scheduled out and a SCHED_OTHER task is scheduled in. Now if the vCPU gets preempted, it is this moment that we need the host kernel to look at the current priority of the task thread running on the vCPU. If it is an RT task, we need to boost the vCPU to that priority, so that a lower priority host thread does not interrupt it. The host should also set a bit in the shared memory to tell the guest that it was boosted. Then when the vCPU schedules a lower priority task than what is in shared memory, and the bit is set that tells the guest the host boosted the vCPU, it needs to make a hypercall to tell the host that it can lower its priority again. The incoming irq is to handle the race between the event that wakes the RT task, and the RT task getting a chance to run. If the preemption happens there, the vCPU may never have a chance to notify the host that it wants to run an RT task. -- Steve
On Fri, 12 Jul 2024 10:08:43 -0700 Sean Christopherson <seanjc@google.com> wrote: > > I am a bit confused by your statement Sean, because if a higher prio HOST > > thread wakes up on the vCPU thread's phyiscal CPU, then a VM-Exit should > > happen. That has nothing to do with IRQ delivery. What am I missing? > > Why does that require hooking VM-Exit? To do the lazy boosting. That's the time that the host can see the priority of the currently running task. -- Steve
On 2024-07-12 12:24, Joel Fernandes wrote: > On Fri, Jul 12, 2024 at 10:09 AM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: [...] >>> >>> Steven Rostedt told me, what we instead need is a tracepoint callback in a >>> driver, that does the boosting. >> >> I utterly dislike changing the system behavior through tracepoints. They were >> designed to observe the system, not modify its behavior. If people start abusing >> them, then subsystem maintainers will stop adding them. Please don't do that. >> Add a notifier or think about integrating what you are planning to add into the >> driver instead. > > Well, we do have "raw" tracepoints not accessible from userspace, so > you're saying even those are off limits for adding callbacks? Yes. Even the "raw" tracepoints were designed as an "observation only" API. Using them in lieu of notifiers is really repurposing them for something they were not meant to do. Just in terms of maintainability at the caller site, we should be allowed to consider _all_ tracepoints as mostly exempt from side-effects outside of the data structures within the attached tracers. This is not true anymore if they are repurposed as notifiers. Thanks, Mathieu
On Fri, Jul 12, 2024, Steven Rostedt wrote: > On Fri, 12 Jul 2024 09:44:16 -0700 > Sean Christopherson <seanjc@google.com> wrote: > > > > All we need is a notifier that gets called at every VMEXIT. > > > > Why? The only argument I've seen for needing to hook VM-Exit is so that the > > host can speculatively boost the priority of the vCPU when deliverying an IRQ, > > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted > > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on > > modern hardware that supports posted interrupts and IPI virtualization, i.e. for > > which there will be no VM-Exit. > > No. The speculatively boost was for something else, but slightly > related. I guess the ideal there was to have the interrupt coming in > boost the vCPU because the interrupt could be waking an RT task. It may > still be something needed, but that's not what I'm talking about here. > > The idea here is when an RT task is scheduled in on the guest, we want > to lazily boost it. As long as the vCPU is running on the CPU, we do > not need to do anything. If the RT task is scheduled for a very short > time, it should not need to call any hypercall. It would set the shared > memory to the new priority when the RT task is scheduled, and then put > back the lower priority when it is scheduled out and a SCHED_OTHER task > is scheduled in. > > Now if the vCPU gets preempted, it is this moment that we need the host > kernel to look at the current priority of the task thread running on > the vCPU. If it is an RT task, we need to boost the vCPU to that > priority, so that a lower priority host thread does not interrupt it. I got all that, but I still don't see any need to hook VM-Exit. If the vCPU gets preempted, the host scheduler is already getting "notified", otherwise the vCPU would still be scheduled in, i.e. wouldn't have been preempted. > The host should also set a bit in the shared memory to tell the guest > that it was boosted. Then when the vCPU schedules a lower priority task > than what is in shared memory, and the bit is set that tells the guest > the host boosted the vCPU, it needs to make a hypercall to tell the > host that it can lower its priority again. Which again doesn't _need_ a dedicated/manual VM-Exit. E.g. why force the host to reasses the priority instead of simply waiting until the next reschedule? If the host is running tickless, then presumably there is a scheduling entity running on a different pCPU, i.e. that can react to vCPU priority changes without needing a VM-Exit.
On Tue, 16 Jul 2024 16:44:05 -0700 Sean Christopherson <seanjc@google.com> wrote: > > > > Now if the vCPU gets preempted, it is this moment that we need the host > > kernel to look at the current priority of the task thread running on > > the vCPU. If it is an RT task, we need to boost the vCPU to that > > priority, so that a lower priority host thread does not interrupt it. > > I got all that, but I still don't see any need to hook VM-Exit. If the vCPU gets > preempted, the host scheduler is already getting "notified", otherwise the vCPU > would still be scheduled in, i.e. wouldn't have been preempted. The guest wants to lazily up its priority when needed. So, it changes its priority on this shared memory, but the host doesn't know about the raised priority, and decides to preempt it (where it would not if it knew the priority was raised). Then it exits into the host via VMEXIT. When else is the host going to know of this priority changed? > > > The host should also set a bit in the shared memory to tell the guest > > that it was boosted. Then when the vCPU schedules a lower priority task > > than what is in shared memory, and the bit is set that tells the guest > > the host boosted the vCPU, it needs to make a hypercall to tell the > > host that it can lower its priority again. > > Which again doesn't _need_ a dedicated/manual VM-Exit. E.g. why force the host > to reasses the priority instead of simply waiting until the next reschedule? If > the host is running tickless, then presumably there is a scheduling entity running > on a different pCPU, i.e. that can react to vCPU priority changes without needing > a VM-Exit. This is done in a shared memory location. The guest can raise and lower its priority via writing into the shared memory. It may raise and lower it back without the host ever knowing. No hypercall needed. But if it raises its priority, and the host decides to schedule it because the host is unaware of its raised priority, it will preempt it. Then when it exits into the host (via VMEXIT) this is the first time the host will know that its priority was raised, and then we can call something like rt_mutex_setprio() to lazily change its priority. It would then also set a bit to inform the guest that the host knows of the change, and when the guest lowers its priority, it will now need to make a hypercall to tell the kernel its priority is low again, and it's OK to preempt it normally. This is similar to how some architectures do lazy irq disabling. Where they only set some memory that says interrupts are disabled. But interrupts only get disabled if an interrupt goes off and the code sees it's "soft disabled", and then will disable interrupts. When the interrupts are enabled again, it then calls the interrupt handler. What are you suggesting to do for this fast way of increasing and decreasing the priority of tasks? -- Steve
On Tue, Jul 16, 2024 at 7:44 PM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Jul 12, 2024, Steven Rostedt wrote: > > On Fri, 12 Jul 2024 09:44:16 -0700 > > Sean Christopherson <seanjc@google.com> wrote: > > > > > > All we need is a notifier that gets called at every VMEXIT. > > > > > > Why? The only argument I've seen for needing to hook VM-Exit is so that the > > > host can speculatively boost the priority of the vCPU when deliverying an IRQ, > > > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted > > > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on > > > modern hardware that supports posted interrupts and IPI virtualization, i.e. for > > > which there will be no VM-Exit. > > > > No. The speculatively boost was for something else, but slightly > > related. I guess the ideal there was to have the interrupt coming in > > boost the vCPU because the interrupt could be waking an RT task. It may > > still be something needed, but that's not what I'm talking about here. > > > > The idea here is when an RT task is scheduled in on the guest, we want > > to lazily boost it. As long as the vCPU is running on the CPU, we do > > not need to do anything. If the RT task is scheduled for a very short > > time, it should not need to call any hypercall. It would set the shared > > memory to the new priority when the RT task is scheduled, and then put > > back the lower priority when it is scheduled out and a SCHED_OTHER task > > is scheduled in. > > > > Now if the vCPU gets preempted, it is this moment that we need the host > > kernel to look at the current priority of the task thread running on > > the vCPU. If it is an RT task, we need to boost the vCPU to that > > priority, so that a lower priority host thread does not interrupt it. > > I got all that, but I still don't see any need to hook VM-Exit. If the vCPU gets > preempted, the host scheduler is already getting "notified", otherwise the vCPU > would still be scheduled in, i.e. wouldn't have been preempted. What you're saying is the scheduler should change the priority of the vCPU thread dynamically. That's really not the job of the scheduler. The user of the scheduler is what changes the priority of threads, not the scheduler itself. Joel
On Wed, Jul 17, 2024, Joel Fernandes wrote: > On Tue, Jul 16, 2024 at 7:44 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Fri, Jul 12, 2024, Steven Rostedt wrote: > > > On Fri, 12 Jul 2024 09:44:16 -0700 > > > Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > All we need is a notifier that gets called at every VMEXIT. > > > > > > > > Why? The only argument I've seen for needing to hook VM-Exit is so that the > > > > host can speculatively boost the priority of the vCPU when deliverying an IRQ, > > > > but (a) I'm unconvinced that is necessary, i.e. that the vCPU needs to be boosted > > > > _before_ the guest IRQ handler is invoked and (b) it has almost no benefit on > > > > modern hardware that supports posted interrupts and IPI virtualization, i.e. for > > > > which there will be no VM-Exit. > > > > > > No. The speculatively boost was for something else, but slightly > > > related. I guess the ideal there was to have the interrupt coming in > > > boost the vCPU because the interrupt could be waking an RT task. It may > > > still be something needed, but that's not what I'm talking about here. > > > > > > The idea here is when an RT task is scheduled in on the guest, we want > > > to lazily boost it. As long as the vCPU is running on the CPU, we do > > > not need to do anything. If the RT task is scheduled for a very short > > > time, it should not need to call any hypercall. It would set the shared > > > memory to the new priority when the RT task is scheduled, and then put > > > back the lower priority when it is scheduled out and a SCHED_OTHER task > > > is scheduled in. > > > > > > Now if the vCPU gets preempted, it is this moment that we need the host > > > kernel to look at the current priority of the task thread running on > > > the vCPU. If it is an RT task, we need to boost the vCPU to that > > > priority, so that a lower priority host thread does not interrupt it. > > > > I got all that, but I still don't see any need to hook VM-Exit. If the vCPU gets > > preempted, the host scheduler is already getting "notified", otherwise the vCPU > > would still be scheduled in, i.e. wouldn't have been preempted. > > What you're saying is the scheduler should change the priority of the > vCPU thread dynamically. That's really not the job of the scheduler. > The user of the scheduler is what changes the priority of threads, not > the scheduler itself. No. If we go the proposed route[*] of adding a data structure that lets userspace and/or the guest express/adjust the task's priority, then the scheduler simply checks that data structure when querying the priority of a task. [*] https://lore.kernel.org/all/ZpFWfInsXQdPJC0V@google.com
On Wed, 17 Jul 2024 07:14:59 -0700 Sean Christopherson <seanjc@google.com> wrote: > > What you're saying is the scheduler should change the priority of the > > vCPU thread dynamically. That's really not the job of the scheduler. > > The user of the scheduler is what changes the priority of threads, not > > the scheduler itself. > > No. If we go the proposed route[*] of adding a data structure that lets userspace > and/or the guest express/adjust the task's priority, then the scheduler simply > checks that data structure when querying the priority of a task. The problem with that is the only use case for such a feature is for vCPUS. There's no use case for a single thread to up and down its priority. I work a lot in RT applications (well, not as much anymore, but my career was heavy into it). And I can't see any use case where a single thread would bounce its priority around. In fact, if I did see that, I would complain that it was a poorly designed system. Now for a guest kernel, that's very different. It has to handle things like priority inheritance and such, where bouncing a threads (or its own vCPU thread) priority most definitely makes sense. So you are requesting that we add a bad user space interface to allow lazy priority management from a thread so that we can use it in the proper use case of a vCPU? -- Steve
On Wed, 17 Jul 2024 10:36:47 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > The problem with that is the only use case for such a feature is for > vCPUS. There's no use case for a single thread to up and down its > priority. I work a lot in RT applications (well, not as much anymore, > but my career was heavy into it). And I can't see any use case where a > single thread would bounce its priority around. In fact, if I did see > that, I would complain that it was a poorly designed system. > > Now for a guest kernel, that's very different. It has to handle things > like priority inheritance and such, where bouncing a threads (or its > own vCPU thread) priority most definitely makes sense. > > So you are requesting that we add a bad user space interface to allow > lazy priority management from a thread so that we can use it in the > proper use case of a vCPU? Now I stated the above thinking you wanted to add a generic interface for all user space. But perhaps there is a way to get this to be done by the scheduler itself. But its use case is still only for VMs. We could possibly add a new sched class that has a dynamic priority. That is, it can switch between other sched classes. A vCPU thread could be assigned to this class from inside the kernel (via a virtio device) where this is not exposed to user space at all. Then the virtio device would control the mapping of a page between the vCPU thread and the host kernel. When this task gets scheduled, it can call into the code that handles the dynamic priority. This will require buy-in from the scheduler folks. This could also handle the case of a vCPU being woken up by an interrupt, as the hooks could be there on the wakeup side as well. Thoughts? -- Steve
On Wed, 17 Jul 2024 10:52:33 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> We could possibly add a new sched class that has a dynamic priority.
It wouldn't need to be a new sched class. This could work with just a
task_struct flag.
It would only need to be checked in pick_next_task() and
try_to_wake_up(). It would require that the shared memory has to be
allocated by the host kernel and always present (unlike rseq). But this
coming from a virtio device driver, that shouldn't be a problem.
If this flag is set on current, then the first thing that
pick_next_task() should do is to see if it needs to change current's
priority and policy (via a callback to the driver). And then it can
decide what task to pick, as if current was boosted, it could very well
be the next task again.
In try_to_wake_up(), if the task waking up has this flag set, it could
boost it via an option set by the virtio device. This would allow it to
preempt the current process if necessary and get on the CPU. Then the
guest would be require to lower its priority if it the boost was not
needed.
Hmm, this could work.
-- Steve
On Thu, Jul 18, 2024 at 12:20 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Wed, 17 Jul 2024 10:52:33 -0400 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > We could possibly add a new sched class that has a dynamic priority. > > It wouldn't need to be a new sched class. This could work with just a > task_struct flag. > > It would only need to be checked in pick_next_task() and > try_to_wake_up(). It would require that the shared memory has to be > allocated by the host kernel and always present (unlike rseq). But this > coming from a virtio device driver, that shouldn't be a problem. > > If this flag is set on current, then the first thing that > pick_next_task() should do is to see if it needs to change current's > priority and policy (via a callback to the driver). And then it can > decide what task to pick, as if current was boosted, it could very well > be the next task again. > > In try_to_wake_up(), if the task waking up has this flag set, it could > boost it via an option set by the virtio device. This would allow it to > preempt the current process if necessary and get on the CPU. Then the > guest would be require to lower its priority if it the boost was not > needed. > > Hmm, this could work. For what it's worth, I proposed something somewhat conceptually similar before: https://lore.kernel.org/kvm/CABCjUKBXCFO4-cXAUdbYEKMz4VyvZ5hD-1yP9H7S7eL8XsqO-g@mail.gmail.com/T/ Guests VCPUs would report their preempt_count to the host and the host would use that to try not to preempt a VCPU that was in a critical section (with some simple safeguards in case the guest was not well behaved). (It worked by adding a "may_preempt" notifier that would get called in schedule(), whose return value would determine whether we'd try to schedule away from current or not.) It was VM specific, but the same idea could be made to work for generic userspace tasks. -- Suleiman
On Wed, Jul 17, 2024 at 11:20 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Wed, 17 Jul 2024 10:52:33 -0400 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > We could possibly add a new sched class that has a dynamic priority. > > It wouldn't need to be a new sched class. This could work with just a > task_struct flag. > > It would only need to be checked in pick_next_task() and > try_to_wake_up(). It would require that the shared memory has to be > allocated by the host kernel and always present (unlike rseq). But this > coming from a virtio device driver, that shouldn't be a problem. Problem is its not only about preemption, if we set the vCPU boosted to RT class, and another RT task is already running on the same CPU, then the vCPU thread should get migrated to different CPU. We can't do that I think if we just did it without doing a proper sched_setscheduler() / sched_setattr() and let the scheduler handle things. Vineeth's patches was doing that in VMEXIT.. - Joel
On Wed, 17 Jul 2024 16:57:43 -0400 Joel Fernandes <joel@joelfernandes.org> wrote: > On Wed, Jul 17, 2024 at 11:20 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > On Wed, 17 Jul 2024 10:52:33 -0400 > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > We could possibly add a new sched class that has a dynamic priority. > > > > It wouldn't need to be a new sched class. This could work with just a > > task_struct flag. > > > > It would only need to be checked in pick_next_task() and > > try_to_wake_up(). It would require that the shared memory has to be > > allocated by the host kernel and always present (unlike rseq). But this > > coming from a virtio device driver, that shouldn't be a problem. > > Problem is its not only about preemption, if we set the vCPU boosted > to RT class, and another RT task is already running on the same CPU, That can only happen on wakeup (interrupt). As the point of lazy priority changing, it is only done when the vCPU is running. -- Steve > then the vCPU thread should get migrated to different CPU. We can't do > that I think if we just did it without doing a proper > sched_setscheduler() / sched_setattr() and let the scheduler handle > things. Vineeth's patches was doing that in VMEXIT..
On Wed, Jul 17, 2024 at 5:00 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Wed, 17 Jul 2024 16:57:43 -0400 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > On Wed, Jul 17, 2024 at 11:20 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > On Wed, 17 Jul 2024 10:52:33 -0400 > > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > We could possibly add a new sched class that has a dynamic priority. > > > > > > It wouldn't need to be a new sched class. This could work with just a > > > task_struct flag. > > > > > > It would only need to be checked in pick_next_task() and > > > try_to_wake_up(). It would require that the shared memory has to be > > > allocated by the host kernel and always present (unlike rseq). But this > > > coming from a virtio device driver, that shouldn't be a problem. > > > > Problem is its not only about preemption, if we set the vCPU boosted > > to RT class, and another RT task is already running on the same CPU, > > That can only happen on wakeup (interrupt). As the point of lazy > priority changing, it is only done when the vCPU is running. True, but I think it will miss stuff related to load balancing, say if the "boost" is a higher CFS priority. Then someone has to pull the running vCPU thread to another CPU etc... IMO it is better to set the priority/class externally and let the scheduler deal with it. Let me think some more about your idea though.. thanks, - Joel