Message ID | 20210916162451.709260-1-guro@fb.com (mailing list archive) |
---|---|
Headers | show |
Series | Scheduler BPF | expand |
Hello! I'm sorry, somehow the patchset didn't reach mailing lists and at least some recepients yesterday (I'm digging into why). Resending with a fewer people in the cc list, which probably was the reason. Thanks! On Thu, Sep 16, 2021 at 09:24:45AM -0700, Roman Gushchin wrote: > There is a long history of distro people, system administrators, and > application owners tuning the CFS settings in /proc/sys, which are now > in debugfs. Looking at what these settings actually did, it ended up > boiling down to changing the likelihood of task preemption, or > disabling it by setting the wakeup_granularity_ns to more than half of > the latency_ns. The other settings didn't really do much for > performance. >
Hi Roman On 09/16/21 09:24, Roman Gushchin wrote: > There is a long history of distro people, system administrators, and > application owners tuning the CFS settings in /proc/sys, which are now > in debugfs. Looking at what these settings actually did, it ended up > boiling down to changing the likelihood of task preemption, or > disabling it by setting the wakeup_granularity_ns to more than half of > the latency_ns. The other settings didn't really do much for > performance. > > In other words, some our workloads benefit by having long running tasks > preempted by tasks handling short running requests, and some workloads > that run only short term requests which benefit from never being preempted. We had discussion about introducing latency-nice hint; but that discussion didn't end up producing any new API. Your use case seem similar to Android's; we want some tasks to run ASAP. There's an out of tree patch that puts these tasks on an idle CPU (keep in mind energy aware scheduling in the context here) which seem okay for its purpose. Having a more generic solution in mainline would be nice. https://lwn.net/Articles/820659/ > > This leads to a few observations and ideas: > - Different workloads want different policies. Being able to configure > the policy per workload could be useful. > - A workload that benefits from not being preempted itself could still > benefit from preempting (low priority) background system tasks. You can put these tasks as SCHED_IDLE. There's a potential danger of starving these tasks; but assuming they're background and there's idle time in the system that should be fine. https://lwn.net/Articles/805317/ That of course assuming you can classify these background tasks.. If you can do the classification, you can also use cpu.shares to reduce how much cpu time they get. Or CFS bandwidth controller https://lwn.net/Articles/844976/ I like Androd's model of classifying tasks. I think we need this classification done by other non-android systems too. > - It would be useful to quickly (and safely) experiment with different > policies in production, without having to shut down applications or reboot > systems, to determine what the policies for different workloads should be. Userspace should have the knobs that allows them to tune that without reboot. If you're doing kernel development; then it's part of the job spec I'd say :-) I think one can still go with the workflow you suggest for development without the hooks. You'd need to un-inline the function you're interested in; then you can use kprobes to hook into it and force an early return. That should produce the same effect, no? > - Only a few workloads are large and sensitive enough to merit their own > policy tweaks. CFS by itself should be good enough for everything else, > and we probably do not want policy tweaks to be a replacement for anything > CFS does. > > This leads to BPF hooks, which have been successfully used in various > kernel subsystems to provide a way for external code to (safely) > change a few kernel decisions. BPF tooling makes this pretty easy to do, > and the people deploying BPF scripts are already quite used to updating them > for new kernel versions. I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that gets heavily modified by vendors and OEMs. We try very hard to understand the problems they face and get the right set of solutions in mainline. Which would ultimately help towards the goal of having a single Generic kernel Image [1] that gives you what you'd expect out of the platform without any need for additional cherries on top. So my worry is that this will open the gate for these hooks to get more than just micro-optimization done in a platform specific way. And that it will discourage having the right discussion to fix real problems in the scheduler because the easy path is to do whatever you want in userspace. I am not sure we can control how these hooks are used. The question is: why can't we fix any issues in the scheduler/make it better and must have these hooks instead? [1] https://arstechnica.com/gadgets/2021/09/android-to-take-an-upstream-first-development-model-for-the-linux-kernel/ Thanks -- Qais Yousef > > This patchset aims to start a discussion about potential applications of BPF > to the scheduler. It also aims to land some very basic BPF infrastructure > necessary to add new BPF hooks to the scheduler, a minimal set of useful > helpers, corresponding libbpf changes, etc. > > Our very first experiments with using BPF in CFS look very promising. We're > at a very early stage, however already have seen a nice latency and ~1% RPS > wins for our (Facebook's) main web workload. > > As I know, Google is working on a more radical approach [2]: they aim to move > the scheduling code into userspace. It seems that their core motivation is > somewhat similar: to make the scheduler changes easier to develop, validate > and deploy. Even though their approach is different, they also use BPF for > speeding up some hot paths. I think the suggested infrastructure can serve > their purpose too. > > An example of an userspace part, which loads some simple hooks is available > here [3]. It's very simple, provided only to simplify playing with the provided > kernel patches. > > > [1] c722f35b513f ("sched/fair: Bring back select_idle_smt(), but differently") > [2] Google's ghOSt: https://linuxplumbersconf.org/event/11/contributions/954/ > [3] https://github.com/rgushchin/atc > > > Roman Gushchin (6): > bpf: sched: basic infrastructure for scheduler bpf > bpf: sched: add convenient helpers to identify sched entities > bpf: sched: introduce bpf_sched_enable() > sched: cfs: add bpf hooks to control wakeup and tick preemption > libbpf: add support for scheduler bpf programs > bpftool: recognize scheduler programs > > include/linux/bpf_sched.h | 53 ++++++++++++ > include/linux/bpf_types.h | 3 + > include/linux/sched_hook_defs.h | 4 + > include/uapi/linux/bpf.h | 25 ++++++ > kernel/bpf/btf.c | 1 + > kernel/bpf/syscall.c | 21 ++++- > kernel/bpf/trampoline.c | 1 + > kernel/bpf/verifier.c | 9 ++- > kernel/sched/Makefile | 1 + > kernel/sched/bpf_sched.c | 138 ++++++++++++++++++++++++++++++++ > kernel/sched/fair.c | 27 +++++++ > scripts/bpf_doc.py | 2 + > tools/bpf/bpftool/common.c | 1 + > tools/bpf/bpftool/prog.c | 1 + > tools/include/uapi/linux/bpf.h | 25 ++++++ > tools/lib/bpf/libbpf.c | 27 ++++++- > tools/lib/bpf/libbpf.h | 4 + > tools/lib/bpf/libbpf.map | 3 + > 18 files changed, 341 insertions(+), 5 deletions(-) > create mode 100644 include/linux/bpf_sched.h > create mode 100644 include/linux/sched_hook_defs.h > create mode 100644 kernel/sched/bpf_sched.c > > -- > 2.31.1 >
On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote: > Hi Roman > > On 09/16/21 09:24, Roman Gushchin wrote: > > There is a long history of distro people, system administrators, and > > application owners tuning the CFS settings in /proc/sys, which are now > > in debugfs. Looking at what these settings actually did, it ended up > > boiling down to changing the likelihood of task preemption, or > > disabling it by setting the wakeup_granularity_ns to more than half of > > the latency_ns. The other settings didn't really do much for > > performance. > > > > In other words, some our workloads benefit by having long running tasks > > preempted by tasks handling short running requests, and some workloads > > that run only short term requests which benefit from never being preempted. > > We had discussion about introducing latency-nice hint; but that discussion > didn't end up producing any new API. Your use case seem similar to Android's; > we want some tasks to run ASAP. There's an out of tree patch that puts these > tasks on an idle CPU (keep in mind energy aware scheduling in the context here) > which seem okay for its purpose. Having a more generic solution in mainline > would be nice. > > https://lwn.net/Articles/820659/ Hello Qais! Thank you for the link, I like it! > > > > > This leads to a few observations and ideas: > > - Different workloads want different policies. Being able to configure > > the policy per workload could be useful. > > - A workload that benefits from not being preempted itself could still > > benefit from preempting (low priority) background system tasks. > > You can put these tasks as SCHED_IDLE. There's a potential danger of starving > these tasks; but assuming they're background and there's idle time in the > system that should be fine. > > https://lwn.net/Articles/805317/ > > That of course assuming you can classify these background tasks.. > > If you can do the classification, you can also use cpu.shares to reduce how > much cpu time they get. Or CFS bandwidth controller > > https://lwn.net/Articles/844976/ The cfs cgroup controller is that it's getting quite expensive quickly with the increasing depth of the cgroup tree. This is why we had to disable it for some of our primary workloads. Still being able to control latencies on per-cgroup level is one of the goals of this patchset. > > I like Androd's model of classifying tasks. I think we need this classification > done by other non-android systems too. > > > - It would be useful to quickly (and safely) experiment with different > > policies in production, without having to shut down applications or reboot > > systems, to determine what the policies for different workloads should be. > > Userspace should have the knobs that allows them to tune that without reboot. > If you're doing kernel development; then it's part of the job spec I'd say :-) The problem here occurs because there is no comprehensive way to test any scheduler change rather than run it on many machines (sometimes 1000's) running different production-alike workloads. If I'm able to test an idea by loading a bpf program (and btw have some sort of safety guarantees: maybe the performance will be hurt, but at least no panics), it can speed up the development process significantly. The alternative is way more complex from the infrastructure's point of view: releasing a custom kernel, test it for safety, reboot certain machines to it, pin the kernel from being automatically updated etc. > > I think one can still go with the workflow you suggest for development without > the hooks. You'd need to un-inline the function you're interested in; then you > can use kprobes to hook into it and force an early return. That should produce > the same effect, no? Basically it's exactly what I'm suggesting. My patchset just provides a convenient way to define these hooks and some basic useful helper functions. > > > - Only a few workloads are large and sensitive enough to merit their own > > policy tweaks. CFS by itself should be good enough for everything else, > > and we probably do not want policy tweaks to be a replacement for anything > > CFS does. > > > > This leads to BPF hooks, which have been successfully used in various > > kernel subsystems to provide a way for external code to (safely) > > change a few kernel decisions. BPF tooling makes this pretty easy to do, > > and the people deploying BPF scripts are already quite used to updating them > > for new kernel versions. > > I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that > gets heavily modified by vendors and OEMs. We try very hard to understand the > problems they face and get the right set of solutions in mainline. Which would > ultimately help towards the goal of having a single Generic kernel Image [1] > that gives you what you'd expect out of the platform without any need for > additional cherries on top. Wouldn't it make your life easier had they provide a set of bpf programs instead of custom patches? > > So my worry is that this will open the gate for these hooks to get more than > just micro-optimization done in a platform specific way. And that it will > discourage having the right discussion to fix real problems in the scheduler > because the easy path is to do whatever you want in userspace. I am not sure we > can control how these hooks are used. I totally understand your worry. I think we need to find a right balance between allowing to implement custom policies and keeping the core functionality working well enough for everybody without a need to tweak anything. It seems like an alternative to this "let's allow cfs customization via bpf" approach is to completely move the scheduler code into userspace/bpf, something that Google's ghOSt is aiming to do. > > The question is: why can't we fix any issues in the scheduler/make it better > and must have these hooks instead? Of course, if it's possible to implement an idea in a form which is suitable for everybody and upstream it, this is the best outcome. The problem is that not every idea is like that. A bpf program can leverage a priori knowledge of a workload and its needs, something the generic scheduler code lacks by the definition. Thanks!
Hi Roman On 10/06/21 11:50, Roman Gushchin wrote: > On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote: > > Hi Roman > > > > On 09/16/21 09:24, Roman Gushchin wrote: > > > There is a long history of distro people, system administrators, and > > > application owners tuning the CFS settings in /proc/sys, which are now > > > in debugfs. Looking at what these settings actually did, it ended up > > > boiling down to changing the likelihood of task preemption, or > > > disabling it by setting the wakeup_granularity_ns to more than half of > > > the latency_ns. The other settings didn't really do much for > > > performance. > > > > > > In other words, some our workloads benefit by having long running tasks > > > preempted by tasks handling short running requests, and some workloads > > > that run only short term requests which benefit from never being preempted. > > > > We had discussion about introducing latency-nice hint; but that discussion > > didn't end up producing any new API. Your use case seem similar to Android's; > > we want some tasks to run ASAP. There's an out of tree patch that puts these > > tasks on an idle CPU (keep in mind energy aware scheduling in the context here) > > which seem okay for its purpose. Having a more generic solution in mainline > > would be nice. > > > > https://lwn.net/Articles/820659/ > > Hello Qais! > > Thank you for the link, I like it! > > > > > > > > > This leads to a few observations and ideas: > > > - Different workloads want different policies. Being able to configure > > > the policy per workload could be useful. > > > - A workload that benefits from not being preempted itself could still > > > benefit from preempting (low priority) background system tasks. > > > > You can put these tasks as SCHED_IDLE. There's a potential danger of starving > > these tasks; but assuming they're background and there's idle time in the > > system that should be fine. > > > > https://lwn.net/Articles/805317/ > > > > That of course assuming you can classify these background tasks.. > > > > If you can do the classification, you can also use cpu.shares to reduce how > > much cpu time they get. Or CFS bandwidth controller > > > > https://lwn.net/Articles/844976/ > > The cfs cgroup controller is that it's getting quite expensive quickly with the > increasing depth of the cgroup tree. This is why we had to disable it for some > of our primary workloads. I can understand that.. > > Still being able to control latencies on per-cgroup level is one of the goals > of this patchset. > > > > > I like Androd's model of classifying tasks. I think we need this classification > > done by other non-android systems too. > > > > > - It would be useful to quickly (and safely) experiment with different > > > policies in production, without having to shut down applications or reboot > > > systems, to determine what the policies for different workloads should be. > > > > Userspace should have the knobs that allows them to tune that without reboot. > > If you're doing kernel development; then it's part of the job spec I'd say :-) > > The problem here occurs because there is no comprehensive way to test any > scheduler change rather than run it on many machines (sometimes 1000's) running > different production-alike workloads. > > If I'm able to test an idea by loading a bpf program (and btw have some sort of > safety guarantees: maybe the performance will be hurt, but at least no panics), > it can speed up the development process significantly. The alternative is way > more complex from the infrastructure's point of view: releasing a custom kernel, > test it for safety, reboot certain machines to it, pin the kernel from being > automatically updated etc. This process is unavoidable IMO. Assuming you have these hooks in; as soon as you require a new hook you'll be forced to have a custom kernel with that new hook introduced. Which, in my view, no different than pushing a custom kernel that forces the function of interest to be noinline. Right? > > > > > I think one can still go with the workflow you suggest for development without > > the hooks. You'd need to un-inline the function you're interested in; then you > > can use kprobes to hook into it and force an early return. That should produce > > the same effect, no? > > Basically it's exactly what I'm suggesting. My patchset just provides a > convenient way to define these hooks and some basic useful helper functions. Convenient will be only true assuming you have a full comprehensive list of hooks to never require adding a new one. As I highlighted above, this convenience is limited to hooks that you added now. Do people always want more hooks? Rhetorical question ;-) > > > > > > - Only a few workloads are large and sensitive enough to merit their own > > > policy tweaks. CFS by itself should be good enough for everything else, > > > and we probably do not want policy tweaks to be a replacement for anything > > > CFS does. > > > > > > This leads to BPF hooks, which have been successfully used in various > > > kernel subsystems to provide a way for external code to (safely) > > > change a few kernel decisions. BPF tooling makes this pretty easy to do, > > > and the people deploying BPF scripts are already quite used to updating them > > > for new kernel versions. > > > > I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that > > gets heavily modified by vendors and OEMs. We try very hard to understand the > > problems they face and get the right set of solutions in mainline. Which would > > ultimately help towards the goal of having a single Generic kernel Image [1] > > that gives you what you'd expect out of the platform without any need for > > additional cherries on top. > > Wouldn't it make your life easier had they provide a set of bpf programs instead > of custom patches? Not really. Having consistent mainline behavior is important, and these customization contribute to fragmentation and can throw off userspace developers who find they have to do extra work on some platforms to get the desired outcome. They will be easy to misuse. We want to see the patches and find ways to improve mainline kernel instead. That said, I can see the use case of being able to micro-optimize part of the scheduler in a workload specific way. But then the way I see this support happening (DISCLAIMER, personal opinion :-)) 1. The hooks have to be about replacing specific snippet, like Barry's example where it's an area that is hard to find a generic solution that doesn't have a drawback over a class of workloads. 2. The set of bpf programs that modify it live in the kernel tree for each hook added. Then we can reason about why the hook is there and allow others to reap the benefit. Beside being able to re-evaluate easily if the users still need that hook after a potential improvement that could render it unnecessary. 3. Out of tree bpf programs can only be loaded if special CONFIG option is set so that production kernel can only load known ones that the community knows and have reasoned about. 4. Out of tree bpf programs will taint the kernel. A regression reported with something funny loaded should be flagged as potentially bogus. IMHO this should tame the beast to something useful to address these situations where the change required to improve one workload will harm others and it's hard to come up with a good compromise. Then the hook as you suggest could help implement that policy specifically for that platform/workload. One can note that the behavior I suggest is similar to how modules work :) > > > > > So my worry is that this will open the gate for these hooks to get more than > > just micro-optimization done in a platform specific way. And that it will > > discourage having the right discussion to fix real problems in the scheduler > > because the easy path is to do whatever you want in userspace. I am not sure we > > can control how these hooks are used. > > I totally understand your worry. I think we need to find a right balance between > allowing to implement custom policies and keeping the core functionality > working well enough for everybody without a need to tweak anything. > > It seems like an alternative to this "let's allow cfs customization via bpf" > approach is to completely move the scheduler code into userspace/bpf, something > that Google's ghOSt is aiming to do. Why not ship a custom kernel instead then? > > > > > The question is: why can't we fix any issues in the scheduler/make it better > > and must have these hooks instead? > > Of course, if it's possible to implement an idea in a form which is suitable > for everybody and upstream it, this is the best outcome. The problem is that > not every idea is like that. A bpf program can leverage a priori knowledge > of a workload and its needs, something the generic scheduler code lacks > by the definition. Yep I see your point for certain aspects of the scheduler that are hard to tune universally. We just need to be careful not to end up in a wild west or Anything Can Happen Thursday situation :-) Maybe the maintainers have a different opinion though. Cheers -- Qais Yousef
On Mon, Oct 11, 2021 at 05:38:52PM +0100, Qais Yousef wrote: > Hi Roman > > On 10/06/21 11:50, Roman Gushchin wrote: > > On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote: > > > Hi Roman > > > > > > On 09/16/21 09:24, Roman Gushchin wrote: > > > > There is a long history of distro people, system administrators, and > > > > application owners tuning the CFS settings in /proc/sys, which are now > > > > in debugfs. Looking at what these settings actually did, it ended up > > > > boiling down to changing the likelihood of task preemption, or > > > > disabling it by setting the wakeup_granularity_ns to more than half of > > > > the latency_ns. The other settings didn't really do much for > > > > performance. > > > > > > > > In other words, some our workloads benefit by having long running tasks > > > > preempted by tasks handling short running requests, and some workloads > > > > that run only short term requests which benefit from never being preempted. > > > > > > We had discussion about introducing latency-nice hint; but that discussion > > > didn't end up producing any new API. Your use case seem similar to Android's; > > > we want some tasks to run ASAP. There's an out of tree patch that puts these > > > tasks on an idle CPU (keep in mind energy aware scheduling in the context here) > > > which seem okay for its purpose. Having a more generic solution in mainline > > > would be nice. > > > > > > https://lwn.net/Articles/820659/ > > > > Hello Qais! > > > > Thank you for the link, I like it! > > > > > > > > > > > > > This leads to a few observations and ideas: > > > > - Different workloads want different policies. Being able to configure > > > > the policy per workload could be useful. > > > > - A workload that benefits from not being preempted itself could still > > > > benefit from preempting (low priority) background system tasks. > > > > > > You can put these tasks as SCHED_IDLE. There's a potential danger of starving > > > these tasks; but assuming they're background and there's idle time in the > > > system that should be fine. > > > > > > https://lwn.net/Articles/805317/ > > > > > > That of course assuming you can classify these background tasks.. > > > > > > If you can do the classification, you can also use cpu.shares to reduce how > > > much cpu time they get. Or CFS bandwidth controller > > > > > > https://lwn.net/Articles/844976/ > > > > The cfs cgroup controller is that it's getting quite expensive quickly with the > > increasing depth of the cgroup tree. This is why we had to disable it for some > > of our primary workloads. > > I can understand that.. > > > > > Still being able to control latencies on per-cgroup level is one of the goals > > of this patchset. > > > > > > > > I like Androd's model of classifying tasks. I think we need this classification > > > done by other non-android systems too. > > > > > > > - It would be useful to quickly (and safely) experiment with different > > > > policies in production, without having to shut down applications or reboot > > > > systems, to determine what the policies for different workloads should be. > > > > > > Userspace should have the knobs that allows them to tune that without reboot. > > > If you're doing kernel development; then it's part of the job spec I'd say :-) > > > > The problem here occurs because there is no comprehensive way to test any > > scheduler change rather than run it on many machines (sometimes 1000's) running > > different production-alike workloads. > > > > If I'm able to test an idea by loading a bpf program (and btw have some sort of > > safety guarantees: maybe the performance will be hurt, but at least no panics), > > it can speed up the development process significantly. The alternative is way > > more complex from the infrastructure's point of view: releasing a custom kernel, > > test it for safety, reboot certain machines to it, pin the kernel from being > > automatically updated etc. > > This process is unavoidable IMO. Assuming you have these hooks in; as soon as > you require a new hook you'll be forced to have a custom kernel with that new > hook introduced. Which, in my view, no different than pushing a custom kernel > that forces the function of interest to be noinline. Right? I think a relatively small and stable set of hooks can cover a large percent of potential customization ideas. > > > > > > > > > I think one can still go with the workflow you suggest for development without > > > the hooks. You'd need to un-inline the function you're interested in; then you > > > can use kprobes to hook into it and force an early return. That should produce > > > the same effect, no? > > > > Basically it's exactly what I'm suggesting. My patchset just provides a > > convenient way to define these hooks and some basic useful helper functions. > > Convenient will be only true assuming you have a full comprehensive list of > hooks to never require adding a new one. As I highlighted above, this > convenience is limited to hooks that you added now. > > Do people always want more hooks? Rhetorical question ;-) Why do you think that the list of the hooks will be so large/dynamic? I'm not saying we can figure it out from a first attempt, but I'm pretty sure that after some initial phase it can be relatively stable, e.g. changing only with some _major_ changes in the scheduler code. > > > > > > > > > > - Only a few workloads are large and sensitive enough to merit their own > > > > policy tweaks. CFS by itself should be good enough for everything else, > > > > and we probably do not want policy tweaks to be a replacement for anything > > > > CFS does. > > > > > > > > This leads to BPF hooks, which have been successfully used in various > > > > kernel subsystems to provide a way for external code to (safely) > > > > change a few kernel decisions. BPF tooling makes this pretty easy to do, > > > > and the people deploying BPF scripts are already quite used to updating them > > > > for new kernel versions. > > > > > > I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that > > > gets heavily modified by vendors and OEMs. We try very hard to understand the > > > problems they face and get the right set of solutions in mainline. Which would > > > ultimately help towards the goal of having a single Generic kernel Image [1] > > > that gives you what you'd expect out of the platform without any need for > > > additional cherries on top. > > > > Wouldn't it make your life easier had they provide a set of bpf programs instead > > of custom patches? > > Not really. > > Having consistent mainline behavior is important, and these customization > contribute to fragmentation and can throw off userspace developers who find > they have to do extra work on some platforms to get the desired outcome. They > will be easy to misuse. We want to see the patches and find ways to improve > mainline kernel instead. > > That said, I can see the use case of being able to micro-optimize part of the > scheduler in a workload specific way. But then the way I see this support > happening (DISCLAIMER, personal opinion :-)) > > 1. The hooks have to be about replacing specific snippet, like Barry's > example where it's an area that is hard to find a generic solution > that doesn't have a drawback over a class of workloads. This makes sense to me, and this is a good topic to discuss: which hooks do we really need. I don't think it necessarily has to replace something, but I totally agree on the "hard to find a generic solution" part. > > 2. The set of bpf programs that modify it live in the kernel tree for > each hook added. Then we can reason about why the hook is there and > allow others to reap the benefit. Beside being able to re-evaluate > easily if the users still need that hook after a potential > improvement that could render it unnecessary. > > 3. Out of tree bpf programs can only be loaded if special CONFIG option > is set so that production kernel can only load known ones that the > community knows and have reasoned about. > > 4. Out of tree bpf programs will taint the kernel. A regression > reported with something funny loaded should be flagged as > potentially bogus. 2-4 look as generic bpf questions to me, I don't think there is anything scheduler-specific. So I'd suggest to bring bpf maintainers into the discussion, their input can be very valuable. > > IMHO this should tame the beast to something useful to address these situations > where the change required to improve one workload will harm others and it's > hard to come up with a good compromise. Then the hook as you suggest could help > implement that policy specifically for that platform/workload. > > One can note that the behavior I suggest is similar to how modules work :) The important benefit of bpf is safety guarantees. > > > > > > > > > So my worry is that this will open the gate for these hooks to get more than > > > just micro-optimization done in a platform specific way. And that it will > > > discourage having the right discussion to fix real problems in the scheduler > > > because the easy path is to do whatever you want in userspace. I am not sure we > > > can control how these hooks are used. > > > > I totally understand your worry. I think we need to find a right balance between > > allowing to implement custom policies and keeping the core functionality > > working well enough for everybody without a need to tweak anything. > > > > It seems like an alternative to this "let's allow cfs customization via bpf" > > approach is to completely move the scheduler code into userspace/bpf, something > > that Google's ghOSt is aiming to do. > > Why not ship a custom kernel instead then? Shipping a custom kernel (actually any kernel) at this scale isn't easy or fast. Just for example, imagine a process of rebooting of a 1000000 machines running 1000's different workloads, each with their own redundancy and capacity requirements. This what makes an ability to push scheduler changes without a reboot/kernel upgrade so attractive. Obviously, it's not a case when we talk about a single kernel engineer and their laptop/dev server/vm. > > > > > > > > > The question is: why can't we fix any issues in the scheduler/make it better > > > and must have these hooks instead? > > > > Of course, if it's possible to implement an idea in a form which is suitable > > for everybody and upstream it, this is the best outcome. The problem is that > > not every idea is like that. A bpf program can leverage a priori knowledge > > of a workload and its needs, something the generic scheduler code lacks > > by the definition. > > Yep I see your point for certain aspects of the scheduler that are hard to tune > universally. We just need to be careful not to end up in a wild west or Anything > Can Happen Thursday situation :-) Totally agree! Thanks!
On 10/11/21 11:09, Roman Gushchin wrote: > > Convenient will be only true assuming you have a full comprehensive list of > > hooks to never require adding a new one. As I highlighted above, this > > convenience is limited to hooks that you added now. > > > > Do people always want more hooks? Rhetorical question ;-) > > Why do you think that the list of the hooks will be so large/dynamic? It's not a fact. Just my thoughts/guess based on how things usually end up. It's very likely this will grow. I could be wrong of course :) > I'm not saying we can figure it out from a first attempt, but I'm pretty sure > that after some initial phase it can be relatively stable, e.g. changing only > with some _major_ changes in the scheduler code. My point was that the speed up in workflow will be limited by the what's available. It might be enough for a large use cases as you say, but at some point there will be a new bottleneck that you might think worth experimenting with and the chances a suitable hook is available are 50:50 in theory. So it's not a magical fix where one would *never* have to push a custom kernel on all these systems to experiment with some scheduler changes. > > > > So my worry is that this will open the gate for these hooks to get more than > > > > just micro-optimization done in a platform specific way. And that it will > > > > discourage having the right discussion to fix real problems in the scheduler > > > > because the easy path is to do whatever you want in userspace. I am not sure we > > > > can control how these hooks are used. > > > > > > I totally understand your worry. I think we need to find a right balance between > > > allowing to implement custom policies and keeping the core functionality > > > working well enough for everybody without a need to tweak anything. > > > > > > It seems like an alternative to this "let's allow cfs customization via bpf" > > > approach is to completely move the scheduler code into userspace/bpf, something > > > that Google's ghOSt is aiming to do. > > > > Why not ship a custom kernel instead then? > > Shipping a custom kernel (actually any kernel) at this scale isn't easy or fast. > Just for example, imagine a process of rebooting of a 1000000 machines running > 1000's different workloads, each with their own redundancy and capacity requirements. > > This what makes an ability to push scheduler changes without a reboot/kernel upgrade > so attractive. > > Obviously, it's not a case when we talk about a single kernel engineer and their > laptop/dev server/vm. I think you're still referring to ghOSt here. I thought your 2 use cases are different as you mentioned they "completely move the scheduler code into userspace/bpf"; but it could be just me mis-interpreting what this means. That didn't read to me they want to micro-optimize (few) certain decisions in the scheduler, rather replace it altogether, hence my question. Anyway. My 2cents here is that we should be careful not to introduce something that encourages out-of-tree workarounds for real scheduler problems nor have it done in a way where we lose visibility over how these hooks are used and being able to share it with others who could benefit from the same mico-optimization too. Thanks! -- Qais Yousef
Hi Roman, Scheduler BPF is a great idea. Thanks for the work. Scheduler BPF won’t be a small feature, I think we’d better give a summary of possible hooks it may add first. We must have a *basic rule* to control what it will tend to be to avoid adding BPF hooks here and there. I haven’t found a clear rule yet, but maybe we can learn it from netfilter, which has 5 basic hooks. Regarding the scheduler BPF hooks, some possible basic hooks may be: - Hook for Enqueue - Hook for Dequeue - Hook for Put Prev Task - Hook for Set Next Task > An example of an userspace part, which loads some simple hooks is available > here [3]. It's very simple, provided only to simplify playing with the provided > kernel patches. > You’d better add this userspace code into samples/bpf/. [Some error occurs in my mail client, so I resend it] -- Thanks Yafang
On Thu, Nov 25, 2021 at 02:00:04PM +0800, Yafang Shao wrote: > Hi Roman, Hi Yafang! > > Scheduler BPF is a great idea. > Thanks for the work. Thanks! > > Scheduler BPF won’t be a small feature, I think we’d better give a > summary of possible hooks it may add first. > We must have a *basic rule* to control what it will tend to be to > avoid adding BPF hooks here and there. > I haven’t found a clear rule yet, but maybe we can learn it from > netfilter, which has 5 basic hooks. > Regarding the scheduler BPF hooks, some possible basic hooks may be: > - Hook for Enqueue > - Hook for Dequeue > - Hook for Put Prev Task > - Hook for Set Next Task I think it depends on what we want to achieve. There are several options: we might aim to implement the whole scheduler logic in bpf, we might aim to do some adjustments to the existing scheduler behavior or a mix of those approaches. Bpf as now is now is not capable enough to implement a new scheduler class without a substantial amount of new c code (in form of helpers, maybe custom maps, some verifier changes etc). In particular, it's a challenging to provide strong safety guarantees: any scheduler bpf program loaded shouldn't crash or deadlock the system (otherwise bpf isn't any better than a kernel module). Also performance margins are quite tight. I'm not saying that providing such generic hooks is impossible or useless, but it requires a lot of changes and support code and I'm not sure that we have a good justification for them right now. I think instead we might want to see bpf hooks as a better form of (sysctl) tunables, which are more flexible (e.g. can be used for specific processes, cgroups, cpus, being enabled depending on load, weather, etc) and do not create an ABI (so are easier to maintain). > > > > An example of an userspace part, which loads some simple hooks is available > > here [3]. It's very simple, provided only to simplify playing with the provided > > kernel patches. > > > > You’d better add this userspace code into samples/bpf/. I thought samples/bpf was considered deprecated (in favor to selftests/bpf/), but I'm gonna check with bpf maintainers. Thanks for the idea!
Hi Roman and the list, I have a naive question regarding BPF hook for sched. Given that BPF can also be attached to tracepoint, why do we add a BPF prog type specific to sched? The reason I can come up with is that sched BPF can have retval to drive the scheduling decision in static branch, whereas tracepoint is not able to do this. Is it mainly because of this or anything else? Thanks Huichun
On Sat, Jan 15, 2022 at 04:29:24PM +0800, Huichun Feng wrote: > Hi Roman and the list, Hello Huichun! > > I have a naive question regarding BPF hook for sched. > > Given that BPF can also be attached to tracepoint, why do we add a BPF prog > type specific to sched? Tracing programs can have return values as well, see kretprobes. > > The reason I can come up with is that sched BPF can have retval to drive the > scheduling decision in static branch, whereas tracepoint is not able to do this. > Is it mainly because of this or anything else? Well, you are right that right now there is no strict necessity to introduce a new prog type (aside from static branch mechanism you mentioned), however I believe it's useful in a long run. Sched programs might be able to use a different set of helpers, maybe there will be some additional restrictions, etc. It's an RFC version of the patchset and any ideas, suggestions and critic are highly welcome! Thanks!
Hi Roman and list, We want to implement a programmable scheduler to meet the schedule requirements of different workloads. Using BPF, we can easily deploy schedule policies for specific workloads, quickly verifying without modifying the kernel code. This greatly reduces the cost of deploying new schedule policies in the production environment. Therefore, we want to continue to develop based on your patch. We plan to merge it into the openeuler open-source community and use the community to continuously evolve and maintain it. (link: https://www.openeuler.org/en/) We made some changes to your patch: 1. Adapt to the openeuler-OLK-5.10 branch, which mostly base on linux longterm branch 5.10. 2. Introduce the Kconfig CONFIG_BPF_SCHED to isolate related code at compile time. 3. helpers bpf_sched_entity_to_cgrpid() and bpf_sched_entity_belongs_to_cgrp() are modified to obtain the task group to which the sched entity belongs through se->my_q->tg->css.cgroup. We have some ideas for the next iteration of Scheduler BPF that we would like to share with you: 1.The tag field is added to struct task_struct and struct task_group. Users can use the file system interface to mark different tags for specific workloads. The bpf prog obtains the tags to detect different workloads. 2.Add BPF hook and helper to scheduling processes such as select_task_rq and pick_next_task to enable scalability. It's a new attempt, and there's bound to be a lot of problems later, but it's exciting that it makes the schduler programmable. cheers, Ren Zhijie
Hi Roman and list, We want to implement a programmable scheduler to meet the schedule requirements of different workloads. Using BPF, we can easily deploy schedule policies for specific workloads, quickly verifying without modifying the kernel code. This greatly reduces the cost of deploying new schedule policies in the production environment. Therefore, we want to continue to develop based on your patch. We plan to merge it into the openeuler open-source community and use the community to continuously evolve and maintain it. (link: https://www.openeuler.org/en/) We made some changes to your patch: 1. Adapt to the openeuler-OLK-5.10 branch, which mostly base on linux longterm branch 5.10. 2. Introduce the Kconfig CONFIG_BPF_SCHED to isolate related code at compile time. 3. helpers bpf_sched_entity_to_cgrpid() and bpf_sched_entity_belongs_to_cgrp() are modified to obtain the task group to which the sched entity belongs through se->my_q->tg->css.cgroup. We have some ideas for the next iteration of Scheduler BPF that we would like to share with you: 1.The tag field is added to struct task_struct and struct task_group. Users can use the file system interface to mark different tags for specific workloads. The bpf prog obtains the tags to detect different workloads. 2.Add BPF hook and helper to scheduling processes such as select_task_rq and pick_next_task to enable scalability. It's a new attempt, and there's bound to be a lot of problems later, but it's exciting that it makes the schduler programmable. cheers, Ren Zhijie
On Tue, Jul 19, 2022 at 09:17:24PM +0800, Ren Zhijie wrote: > Hi Roman and list, > > We want to implement a programmable scheduler to meet the schedule > requirements of different workloads. > > Using BPF, we can easily deploy schedule policies for specific workloads, > quickly verifying without modifying the kernel code. This greatly reduces > the cost of deploying new schedule policies in the production environment. > > Therefore, we want to continue to develop based on your patch. We plan to > merge it into the openeuler open-source community and use the community to > continuously evolve and maintain it. > (link: https://www.openeuler.org/en/) > > We made some changes to your patch: > 1. Adapt to the openeuler-OLK-5.10 branch, which mostly base on linux > longterm branch 5.10. > 2. Introduce the Kconfig CONFIG_BPF_SCHED to isolate related code at compile > time. > 3. helpers bpf_sched_entity_to_cgrpid() and > bpf_sched_entity_belongs_to_cgrp() are modified to obtain the task group to > which the sched entity belongs through se->my_q->tg->css.cgroup. > > We have some ideas for the next iteration of Scheduler BPF that we would > like to share with you: > 1.The tag field is added to struct task_struct and struct task_group. Users > can use the file system interface to mark different tags for specific > workloads. The bpf prog obtains the tags to detect different workloads. > 2.Add BPF hook and helper to scheduling processes such as select_task_rq and > pick_next_task to enable scalability. > > It's a new attempt, and there's bound to be a lot of problems later, but > it's exciting that it makes the schduler programmable. Hi Ren! Great to hear my work is useful and thank you for describing your plans! I'm not actively working on it right now, but I might start again in the future. Let me know if I can help you with this effort. Thanks!