[rfc,0/6] Scheduler BPF

Message ID	20210916162451.709260-1-guro@fb.com (mailing list archive)
Headers	show Return-Path: <bpf-owner@kernel.org> From: Roman Gushchin <guro@fb.com> To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com> CC: Mel Gorman <mgorman@techsingularity.net>, <bpf@vger.kernel.org>, <linux-kernel@vger.kernel.org>, Roman Gushchin <guro@fb.com> Subject: [PATCH rfc 0/6] Scheduler BPF Date: Thu, 16 Sep 2021 09:24:45 -0700 Message-ID: <20210916162451.709260-1-guro@fb.com> In-Reply-To: <20210915213550.3696532-1-guro@fb.com> References: <20210915213550.3696532-1-guro@fb.com> Content-Type: text/plain Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Precedence: bulk
Series	Scheduler BPF \| expand [rfc,0/6] Scheduler BPF [rfc,1/6] bpf: sched: basic infrastructure for scheduler bpf [rfc,2/6] bpf: sched: add convenient helpers to identify sched entities [rfc,3/6] bpf: sched: introduce bpf_sched_enable() [rfc,4/6] sched: cfs: add bpf hooks to control wakeup and tick preemption [rfc,5/6] libbpf: add support for scheduler bpf programs [rfc,6/6] bpftool: recognize scheduler programs

Roman Gushchin Sept. 16, 2021, 4:24 p.m. UTC

There is a long history of distro people, system administrators, and
application owners tuning the CFS settings in /proc/sys, which are now
in debugfs. Looking at what these settings actually did, it ended up
boiling down to changing the likelihood of task preemption, or
disabling it by setting the wakeup_granularity_ns to more than half of
the latency_ns. The other settings didn't really do much for
performance.

In other words, some our workloads benefit by having long running tasks
preempted by tasks handling short running requests, and some workloads
that run only short term requests which benefit from never being preempted.

This leads to a few observations and ideas:
- Different workloads want different policies. Being able to configure
  the policy per workload could be useful.
- A workload that benefits from not being preempted itself could still
  benefit from preempting (low priority) background system tasks.
- It would be useful to quickly (and safely) experiment with different
  policies in production, without having to shut down applications or reboot
  systems, to determine what the policies for different workloads should be.
- Only a few workloads are large and sensitive enough to merit their own
  policy tweaks. CFS by itself should be good enough for everything else,
  and we probably do not want policy tweaks to be a replacement for anything
  CFS does.

This leads to BPF hooks, which have been successfully used in various
kernel subsystems to provide a way for external code to (safely)
change a few kernel decisions. BPF tooling makes this pretty easy to do,
and the people deploying BPF scripts are already quite used to updating them
for new kernel versions.

This patchset aims to start a discussion about potential applications of BPF
to the scheduler. It also aims to land some very basic BPF infrastructure
necessary to add new BPF hooks to the scheduler, a minimal set of useful
helpers, corresponding libbpf changes, etc.

Our very first experiments with using BPF in CFS look very promising. We're
at a very early stage, however already have seen a nice latency and ~1% RPS
wins for our (Facebook's) main web workload.

As I know, Google is working on a more radical approach [2]: they aim to move
the scheduling code into userspace. It seems that their core motivation is
somewhat similar: to make the scheduler changes easier to develop, validate
and deploy. Even though their approach is different, they also use BPF for
speeding up some hot paths. I think the suggested infrastructure can serve
their purpose too.

An example of an userspace part, which loads some simple hooks is available
here [3]. It's very simple, provided only to simplify playing with the provided
kernel patches.


[1] c722f35b513f ("sched/fair: Bring back select_idle_smt(), but differently")
[2] Google's ghOSt: https://linuxplumbersconf.org/event/11/contributions/954/
[3] https://github.com/rgushchin/atc


Roman Gushchin (6):
  bpf: sched: basic infrastructure for scheduler bpf
  bpf: sched: add convenient helpers to identify sched entities
  bpf: sched: introduce bpf_sched_enable()
  sched: cfs: add bpf hooks to control wakeup and tick preemption
  libbpf: add support for scheduler bpf programs
  bpftool: recognize scheduler programs

 include/linux/bpf_sched.h       |  53 ++++++++++++
 include/linux/bpf_types.h       |   3 +
 include/linux/sched_hook_defs.h |   4 +
 include/uapi/linux/bpf.h        |  25 ++++++
 kernel/bpf/btf.c                |   1 +
 kernel/bpf/syscall.c            |  21 ++++-
 kernel/bpf/trampoline.c         |   1 +
 kernel/bpf/verifier.c           |   9 ++-
 kernel/sched/Makefile           |   1 +
 kernel/sched/bpf_sched.c        | 138 ++++++++++++++++++++++++++++++++
 kernel/sched/fair.c             |  27 +++++++
 scripts/bpf_doc.py              |   2 +
 tools/bpf/bpftool/common.c      |   1 +
 tools/bpf/bpftool/prog.c        |   1 +
 tools/include/uapi/linux/bpf.h  |  25 ++++++
 tools/lib/bpf/libbpf.c          |  27 ++++++-
 tools/lib/bpf/libbpf.h          |   4 +
 tools/lib/bpf/libbpf.map        |   3 +
 18 files changed, 341 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/bpf_sched.h
 create mode 100644 include/linux/sched_hook_defs.h
 create mode 100644 kernel/sched/bpf_sched.c

Roman Gushchin Sept. 16, 2021, 4:36 p.m. UTC | #1

Hello!

I'm sorry, somehow the patchset didn't reach mailing lists and at least
some recepients yesterday (I'm digging into why).

Resending with a fewer people in the cc list, which probably was the reason.

Thanks!

On Thu, Sep 16, 2021 at 09:24:45AM -0700, Roman Gushchin wrote:
> There is a long history of distro people, system administrators, and
> application owners tuning the CFS settings in /proc/sys, which are now
> in debugfs. Looking at what these settings actually did, it ended up
> boiling down to changing the likelihood of task preemption, or
> disabling it by setting the wakeup_granularity_ns to more than half of
> the latency_ns. The other settings didn't really do much for
> performance.
>

Qais Yousef Oct. 6, 2021, 4:39 p.m. UTC | #2

Hi Roman

On 09/16/21 09:24, Roman Gushchin wrote:
> There is a long history of distro people, system administrators, and
> application owners tuning the CFS settings in /proc/sys, which are now
> in debugfs. Looking at what these settings actually did, it ended up
> boiling down to changing the likelihood of task preemption, or
> disabling it by setting the wakeup_granularity_ns to more than half of
> the latency_ns. The other settings didn't really do much for
> performance.
> 
> In other words, some our workloads benefit by having long running tasks
> preempted by tasks handling short running requests, and some workloads
> that run only short term requests which benefit from never being preempted.

We had discussion about introducing latency-nice hint; but that discussion
didn't end up producing any new API. Your use case seem similar to Android's;
we want some tasks to run ASAP. There's an out of tree patch that puts these
tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
which seem okay for its purpose. Having a more generic solution in mainline
would be nice.

https://lwn.net/Articles/820659/

> 
> This leads to a few observations and ideas:
> - Different workloads want different policies. Being able to configure
>   the policy per workload could be useful.
> - A workload that benefits from not being preempted itself could still
>   benefit from preempting (low priority) background system tasks.

You can put these tasks as SCHED_IDLE. There's a potential danger of starving
these tasks; but assuming they're background and there's idle time in the
system that should be fine.

https://lwn.net/Articles/805317/

That of course assuming you can classify these background tasks..

If you can do the classification, you can also use cpu.shares to reduce how
much cpu time they get. Or CFS bandwidth controller

https://lwn.net/Articles/844976/

I like Androd's model of classifying tasks. I think we need this classification
done by other non-android systems too.

> - It would be useful to quickly (and safely) experiment with different
>   policies in production, without having to shut down applications or reboot
>   systems, to determine what the policies for different workloads should be.

Userspace should have the knobs that allows them to tune that without reboot.
If you're doing kernel development; then it's part of the job spec I'd say :-)

I think one can still go with the workflow you suggest for development without
the hooks. You'd need to un-inline the function you're interested in; then you
can use kprobes to hook into it and force an early return. That should produce
the same effect, no?

> - Only a few workloads are large and sensitive enough to merit their own
>   policy tweaks. CFS by itself should be good enough for everything else,
>   and we probably do not want policy tweaks to be a replacement for anything
>   CFS does.
> 
> This leads to BPF hooks, which have been successfully used in various
> kernel subsystems to provide a way for external code to (safely)
> change a few kernel decisions. BPF tooling makes this pretty easy to do,
> and the people deploying BPF scripts are already quite used to updating them
> for new kernel versions.

I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
gets heavily modified by vendors and OEMs. We try very hard to understand the
problems they face and get the right set of solutions in mainline. Which would
ultimately help towards the goal of having a single Generic kernel Image [1]
that gives you what you'd expect out of the platform without any need for
additional cherries on top.

So my worry is that this will open the gate for these hooks to get more than
just micro-optimization done in a platform specific way. And that it will
discourage having the right discussion to fix real problems in the scheduler
because the easy path is to do whatever you want in userspace. I am not sure we
can control how these hooks are used.

The question is: why can't we fix any issues in the scheduler/make it better
and must have these hooks instead?

[1] https://arstechnica.com/gadgets/2021/09/android-to-take-an-upstream-first-development-model-for-the-linux-kernel/

Thanks

--
Qais Yousef

> 
> This patchset aims to start a discussion about potential applications of BPF
> to the scheduler. It also aims to land some very basic BPF infrastructure
> necessary to add new BPF hooks to the scheduler, a minimal set of useful
> helpers, corresponding libbpf changes, etc.
> 
> Our very first experiments with using BPF in CFS look very promising. We're
> at a very early stage, however already have seen a nice latency and ~1% RPS
> wins for our (Facebook's) main web workload.
> 
> As I know, Google is working on a more radical approach [2]: they aim to move
> the scheduling code into userspace. It seems that their core motivation is
> somewhat similar: to make the scheduler changes easier to develop, validate
> and deploy. Even though their approach is different, they also use BPF for
> speeding up some hot paths. I think the suggested infrastructure can serve
> their purpose too.
> 
> An example of an userspace part, which loads some simple hooks is available
> here [3]. It's very simple, provided only to simplify playing with the provided
> kernel patches.
> 
> 
> [1] c722f35b513f ("sched/fair: Bring back select_idle_smt(), but differently")
> [2] Google's ghOSt: https://linuxplumbersconf.org/event/11/contributions/954/
> [3] https://github.com/rgushchin/atc
> 
> 
> Roman Gushchin (6):
>   bpf: sched: basic infrastructure for scheduler bpf
>   bpf: sched: add convenient helpers to identify sched entities
>   bpf: sched: introduce bpf_sched_enable()
>   sched: cfs: add bpf hooks to control wakeup and tick preemption
>   libbpf: add support for scheduler bpf programs
>   bpftool: recognize scheduler programs
> 
>  include/linux/bpf_sched.h       |  53 ++++++++++++
>  include/linux/bpf_types.h       |   3 +
>  include/linux/sched_hook_defs.h |   4 +
>  include/uapi/linux/bpf.h        |  25 ++++++
>  kernel/bpf/btf.c                |   1 +
>  kernel/bpf/syscall.c            |  21 ++++-
>  kernel/bpf/trampoline.c         |   1 +
>  kernel/bpf/verifier.c           |   9 ++-
>  kernel/sched/Makefile           |   1 +
>  kernel/sched/bpf_sched.c        | 138 ++++++++++++++++++++++++++++++++
>  kernel/sched/fair.c             |  27 +++++++
>  scripts/bpf_doc.py              |   2 +
>  tools/bpf/bpftool/common.c      |   1 +
>  tools/bpf/bpftool/prog.c        |   1 +
>  tools/include/uapi/linux/bpf.h  |  25 ++++++
>  tools/lib/bpf/libbpf.c          |  27 ++++++-
>  tools/lib/bpf/libbpf.h          |   4 +
>  tools/lib/bpf/libbpf.map        |   3 +
>  18 files changed, 341 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/bpf_sched.h
>  create mode 100644 include/linux/sched_hook_defs.h
>  create mode 100644 kernel/sched/bpf_sched.c
> 
> -- 
> 2.31.1
>

Roman Gushchin Oct. 6, 2021, 6:50 p.m. UTC | #3

On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote:
> Hi Roman
> 
> On 09/16/21 09:24, Roman Gushchin wrote:
> > There is a long history of distro people, system administrators, and
> > application owners tuning the CFS settings in /proc/sys, which are now
> > in debugfs. Looking at what these settings actually did, it ended up
> > boiling down to changing the likelihood of task preemption, or
> > disabling it by setting the wakeup_granularity_ns to more than half of
> > the latency_ns. The other settings didn't really do much for
> > performance.
> > 
> > In other words, some our workloads benefit by having long running tasks
> > preempted by tasks handling short running requests, and some workloads
> > that run only short term requests which benefit from never being preempted.
> 
> We had discussion about introducing latency-nice hint; but that discussion
> didn't end up producing any new API. Your use case seem similar to Android's;
> we want some tasks to run ASAP. There's an out of tree patch that puts these
> tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
> which seem okay for its purpose. Having a more generic solution in mainline
> would be nice.
> 
> https://lwn.net/Articles/820659/

Hello Qais!

Thank you for the link, I like it!

> 
> > 
> > This leads to a few observations and ideas:
> > - Different workloads want different policies. Being able to configure
> >   the policy per workload could be useful.
> > - A workload that benefits from not being preempted itself could still
> >   benefit from preempting (low priority) background system tasks.
> 
> You can put these tasks as SCHED_IDLE. There's a potential danger of starving
> these tasks; but assuming they're background and there's idle time in the
> system that should be fine.
> 
> https://lwn.net/Articles/805317/
> 
> That of course assuming you can classify these background tasks..
> 
> If you can do the classification, you can also use cpu.shares to reduce how
> much cpu time they get. Or CFS bandwidth controller
> 
> https://lwn.net/Articles/844976/

The cfs cgroup controller is that it's getting quite expensive quickly with the
increasing depth of the cgroup tree. This is why we had to disable it for some
of our primary workloads.

Still being able to control latencies on per-cgroup level is one of the goals
of this patchset.

> 
> I like Androd's model of classifying tasks. I think we need this classification
> done by other non-android systems too.
> 
> > - It would be useful to quickly (and safely) experiment with different
> >   policies in production, without having to shut down applications or reboot
> >   systems, to determine what the policies for different workloads should be.
> 
> Userspace should have the knobs that allows them to tune that without reboot.
> If you're doing kernel development; then it's part of the job spec I'd say :-)

The problem here occurs because there is no comprehensive way to test any
scheduler change rather than run it on many machines (sometimes 1000's) running
different production-alike workloads.

If I'm able to test an idea by loading a bpf program (and btw have some sort of
safety guarantees: maybe the performance will be hurt, but at least no panics),
it can speed up the development process significantly. The alternative is way
more complex from the infrastructure's point of view: releasing a custom kernel,
test it for safety, reboot certain machines to it, pin the kernel from being
automatically updated etc.

> 
> I think one can still go with the workflow you suggest for development without
> the hooks. You'd need to un-inline the function you're interested in; then you
> can use kprobes to hook into it and force an early return. That should produce
> the same effect, no?

Basically it's exactly what I'm suggesting. My patchset just provides a
convenient way to define these hooks and some basic useful helper functions.

> 
> > - Only a few workloads are large and sensitive enough to merit their own
> >   policy tweaks. CFS by itself should be good enough for everything else,
> >   and we probably do not want policy tweaks to be a replacement for anything
> >   CFS does.
> > 
> > This leads to BPF hooks, which have been successfully used in various
> > kernel subsystems to provide a way for external code to (safely)
> > change a few kernel decisions. BPF tooling makes this pretty easy to do,
> > and the people deploying BPF scripts are already quite used to updating them
> > for new kernel versions.
> 
> I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
> gets heavily modified by vendors and OEMs. We try very hard to understand the
> problems they face and get the right set of solutions in mainline. Which would
> ultimately help towards the goal of having a single Generic kernel Image [1]
> that gives you what you'd expect out of the platform without any need for
> additional cherries on top.

Wouldn't it make your life easier had they provide a set of bpf programs instead
of custom patches?

> 
> So my worry is that this will open the gate for these hooks to get more than
> just micro-optimization done in a platform specific way. And that it will
> discourage having the right discussion to fix real problems in the scheduler
> because the easy path is to do whatever you want in userspace. I am not sure we
> can control how these hooks are used.

I totally understand your worry. I think we need to find a right balance between
allowing to implement custom policies and keeping the core functionality
working well enough for everybody without a need to tweak anything.

It seems like an alternative to this "let's allow cfs customization via bpf"
approach is to completely move the scheduler code into userspace/bpf, something
that Google's ghOSt is aiming to do.

> 
> The question is: why can't we fix any issues in the scheduler/make it better
> and must have these hooks instead?

Of course, if it's possible to implement an idea in a form which is suitable
for everybody and upstream it, this is the best outcome. The problem is that
not every idea is like that. A bpf program can leverage a priori knowledge
of a workload and its needs, something the generic scheduler code lacks
by the definition.

Thanks!

Qais Yousef Oct. 11, 2021, 4:38 p.m. UTC | #4

Hi Roman

On 10/06/21 11:50, Roman Gushchin wrote:
> On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote:
> > Hi Roman
> > 
> > On 09/16/21 09:24, Roman Gushchin wrote:
> > > There is a long history of distro people, system administrators, and
> > > application owners tuning the CFS settings in /proc/sys, which are now
> > > in debugfs. Looking at what these settings actually did, it ended up
> > > boiling down to changing the likelihood of task preemption, or
> > > disabling it by setting the wakeup_granularity_ns to more than half of
> > > the latency_ns. The other settings didn't really do much for
> > > performance.
> > > 
> > > In other words, some our workloads benefit by having long running tasks
> > > preempted by tasks handling short running requests, and some workloads
> > > that run only short term requests which benefit from never being preempted.
> > 
> > We had discussion about introducing latency-nice hint; but that discussion
> > didn't end up producing any new API. Your use case seem similar to Android's;
> > we want some tasks to run ASAP. There's an out of tree patch that puts these
> > tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
> > which seem okay for its purpose. Having a more generic solution in mainline
> > would be nice.
> > 
> > https://lwn.net/Articles/820659/
> 
> Hello Qais!
> 
> Thank you for the link, I like it!
> 
> > 
> > > 
> > > This leads to a few observations and ideas:
> > > - Different workloads want different policies. Being able to configure
> > >   the policy per workload could be useful.
> > > - A workload that benefits from not being preempted itself could still
> > >   benefit from preempting (low priority) background system tasks.
> > 
> > You can put these tasks as SCHED_IDLE. There's a potential danger of starving
> > these tasks; but assuming they're background and there's idle time in the
> > system that should be fine.
> > 
> > https://lwn.net/Articles/805317/
> > 
> > That of course assuming you can classify these background tasks..
> > 
> > If you can do the classification, you can also use cpu.shares to reduce how
> > much cpu time they get. Or CFS bandwidth controller
> > 
> > https://lwn.net/Articles/844976/
> 
> The cfs cgroup controller is that it's getting quite expensive quickly with the
> increasing depth of the cgroup tree. This is why we had to disable it for some
> of our primary workloads.

I can understand that..

> 
> Still being able to control latencies on per-cgroup level is one of the goals
> of this patchset.
> 
> > 
> > I like Androd's model of classifying tasks. I think we need this classification
> > done by other non-android systems too.
> > 
> > > - It would be useful to quickly (and safely) experiment with different
> > >   policies in production, without having to shut down applications or reboot
> > >   systems, to determine what the policies for different workloads should be.
> > 
> > Userspace should have the knobs that allows them to tune that without reboot.
> > If you're doing kernel development; then it's part of the job spec I'd say :-)
> 
> The problem here occurs because there is no comprehensive way to test any
> scheduler change rather than run it on many machines (sometimes 1000's) running
> different production-alike workloads.
> 
> If I'm able to test an idea by loading a bpf program (and btw have some sort of
> safety guarantees: maybe the performance will be hurt, but at least no panics),
> it can speed up the development process significantly. The alternative is way
> more complex from the infrastructure's point of view: releasing a custom kernel,
> test it for safety, reboot certain machines to it, pin the kernel from being
> automatically updated etc.

This process is unavoidable IMO. Assuming you have these hooks in; as soon as
you require a new hook you'll be forced to have a custom kernel with that new
hook introduced. Which, in my view, no different than pushing a custom kernel
that forces the function of interest to be noinline. Right?

> 
> > 
> > I think one can still go with the workflow you suggest for development without
> > the hooks. You'd need to un-inline the function you're interested in; then you
> > can use kprobes to hook into it and force an early return. That should produce
> > the same effect, no?
> 
> Basically it's exactly what I'm suggesting. My patchset just provides a
> convenient way to define these hooks and some basic useful helper functions.

Convenient will be only true assuming you have a full comprehensive list of
hooks to never require adding a new one. As I highlighted above, this
convenience is limited to hooks that you added now.

Do people always want more hooks? Rhetorical question ;-)

> 
> > 
> > > - Only a few workloads are large and sensitive enough to merit their own
> > >   policy tweaks. CFS by itself should be good enough for everything else,
> > >   and we probably do not want policy tweaks to be a replacement for anything
> > >   CFS does.
> > > 
> > > This leads to BPF hooks, which have been successfully used in various
> > > kernel subsystems to provide a way for external code to (safely)
> > > change a few kernel decisions. BPF tooling makes this pretty easy to do,
> > > and the people deploying BPF scripts are already quite used to updating them
> > > for new kernel versions.
> > 
> > I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
> > gets heavily modified by vendors and OEMs. We try very hard to understand the
> > problems they face and get the right set of solutions in mainline. Which would
> > ultimately help towards the goal of having a single Generic kernel Image [1]
> > that gives you what you'd expect out of the platform without any need for
> > additional cherries on top.
> 
> Wouldn't it make your life easier had they provide a set of bpf programs instead
> of custom patches?

Not really.

Having consistent mainline behavior is important, and these customization
contribute to fragmentation and can throw off userspace developers who find
they have to do extra work on some platforms to get the desired outcome. They
will be easy to misuse. We want to see the patches and find ways to improve
mainline kernel instead.

That said, I can see the use case of being able to micro-optimize part of the
scheduler in a workload specific way. But then the way I see this support
happening (DISCLAIMER, personal opinion :-))

	1. The hooks have to be about replacing specific snippet, like Barry's
	   example where it's an area that is hard to find a generic solution
	   that doesn't have a drawback over a class of workloads.

	2. The set of bpf programs that modify it live in the kernel tree for
	   each hook added. Then we can reason about why the hook is there and
	   allow others to reap the benefit. Beside being able to re-evaluate
	   easily if the users still need that hook after a potential
	   improvement that could render it unnecessary.

	3. Out of tree bpf programs can only be loaded if special CONFIG option
	   is set so that production kernel can only load known ones that the
	   community knows and have reasoned about.

	4. Out of tree bpf programs will taint the kernel. A regression
	   reported with something funny loaded should be flagged as
	   potentially bogus.

IMHO this should tame the beast to something useful to address these situations
where the change required to improve one workload will harm others and it's
hard to come up with a good compromise. Then the hook as you suggest could help
implement that policy specifically for that platform/workload.

One can note that the behavior I suggest is similar to how modules work :)

> 
> > 
> > So my worry is that this will open the gate for these hooks to get more than
> > just micro-optimization done in a platform specific way. And that it will
> > discourage having the right discussion to fix real problems in the scheduler
> > because the easy path is to do whatever you want in userspace. I am not sure we
> > can control how these hooks are used.
> 
> I totally understand your worry. I think we need to find a right balance between
> allowing to implement custom policies and keeping the core functionality
> working well enough for everybody without a need to tweak anything.
> 
> It seems like an alternative to this "let's allow cfs customization via bpf"
> approach is to completely move the scheduler code into userspace/bpf, something
> that Google's ghOSt is aiming to do.

Why not ship a custom kernel instead then?

> 
> > 
> > The question is: why can't we fix any issues in the scheduler/make it better
> > and must have these hooks instead?
> 
> Of course, if it's possible to implement an idea in a form which is suitable
> for everybody and upstream it, this is the best outcome. The problem is that
> not every idea is like that. A bpf program can leverage a priori knowledge
> of a workload and its needs, something the generic scheduler code lacks
> by the definition.

Yep I see your point for certain aspects of the scheduler that are hard to tune
universally. We just need to be careful not to end up in a wild west or Anything
Can Happen Thursday situation :-)

Maybe the maintainers have a different opinion though.

Cheers

--
Qais Yousef

Roman Gushchin Oct. 11, 2021, 6:09 p.m. UTC | #5

On Mon, Oct 11, 2021 at 05:38:52PM +0100, Qais Yousef wrote:
> Hi Roman
> 
> On 10/06/21 11:50, Roman Gushchin wrote:
> > On Wed, Oct 06, 2021 at 05:39:49PM +0100, Qais Yousef wrote:
> > > Hi Roman
> > > 
> > > On 09/16/21 09:24, Roman Gushchin wrote:
> > > > There is a long history of distro people, system administrators, and
> > > > application owners tuning the CFS settings in /proc/sys, which are now
> > > > in debugfs. Looking at what these settings actually did, it ended up
> > > > boiling down to changing the likelihood of task preemption, or
> > > > disabling it by setting the wakeup_granularity_ns to more than half of
> > > > the latency_ns. The other settings didn't really do much for
> > > > performance.
> > > > 
> > > > In other words, some our workloads benefit by having long running tasks
> > > > preempted by tasks handling short running requests, and some workloads
> > > > that run only short term requests which benefit from never being preempted.
> > > 
> > > We had discussion about introducing latency-nice hint; but that discussion
> > > didn't end up producing any new API. Your use case seem similar to Android's;
> > > we want some tasks to run ASAP. There's an out of tree patch that puts these
> > > tasks on an idle CPU (keep in mind energy aware scheduling in the context here)
> > > which seem okay for its purpose. Having a more generic solution in mainline
> > > would be nice.
> > > 
> > > https://lwn.net/Articles/820659/ 
> > 
> > Hello Qais!
> > 
> > Thank you for the link, I like it!
> > 
> > > 
> > > > 
> > > > This leads to a few observations and ideas:
> > > > - Different workloads want different policies. Being able to configure
> > > >   the policy per workload could be useful.
> > > > - A workload that benefits from not being preempted itself could still
> > > >   benefit from preempting (low priority) background system tasks.
> > > 
> > > You can put these tasks as SCHED_IDLE. There's a potential danger of starving
> > > these tasks; but assuming they're background and there's idle time in the
> > > system that should be fine.
> > > 
> > > https://lwn.net/Articles/805317/ 
> > > 
> > > That of course assuming you can classify these background tasks..
> > > 
> > > If you can do the classification, you can also use cpu.shares to reduce how
> > > much cpu time they get. Or CFS bandwidth controller
> > > 
> > > https://lwn.net/Articles/844976/ 
> > 
> > The cfs cgroup controller is that it's getting quite expensive quickly with the
> > increasing depth of the cgroup tree. This is why we had to disable it for some
> > of our primary workloads.
> 
> I can understand that..
> 
> > 
> > Still being able to control latencies on per-cgroup level is one of the goals
> > of this patchset.
> > 
> > > 
> > > I like Androd's model of classifying tasks. I think we need this classification
> > > done by other non-android systems too.
> > > 
> > > > - It would be useful to quickly (and safely) experiment with different
> > > >   policies in production, without having to shut down applications or reboot
> > > >   systems, to determine what the policies for different workloads should be.
> > > 
> > > Userspace should have the knobs that allows them to tune that without reboot.
> > > If you're doing kernel development; then it's part of the job spec I'd say :-)
> > 
> > The problem here occurs because there is no comprehensive way to test any
> > scheduler change rather than run it on many machines (sometimes 1000's) running
> > different production-alike workloads.
> > 
> > If I'm able to test an idea by loading a bpf program (and btw have some sort of
> > safety guarantees: maybe the performance will be hurt, but at least no panics),
> > it can speed up the development process significantly. The alternative is way
> > more complex from the infrastructure's point of view: releasing a custom kernel,
> > test it for safety, reboot certain machines to it, pin the kernel from being
> > automatically updated etc.
> 
> This process is unavoidable IMO. Assuming you have these hooks in; as soon as
> you require a new hook you'll be forced to have a custom kernel with that new
> hook introduced. Which, in my view, no different than pushing a custom kernel
> that forces the function of interest to be noinline. Right?

I think a relatively small and stable set of hooks can cover a large percent
of potential customization ideas.

> 
> > 
> > > 
> > > I think one can still go with the workflow you suggest for development without
> > > the hooks. You'd need to un-inline the function you're interested in; then you
> > > can use kprobes to hook into it and force an early return. That should produce
> > > the same effect, no?
> > 
> > Basically it's exactly what I'm suggesting. My patchset just provides a
> > convenient way to define these hooks and some basic useful helper functions.
> 
> Convenient will be only true assuming you have a full comprehensive list of
> hooks to never require adding a new one. As I highlighted above, this
> convenience is limited to hooks that you added now.
> 
> Do people always want more hooks? Rhetorical question ;-)

Why do you think that the list of the hooks will be so large/dynamic?

I'm not saying we can figure it out from a first attempt, but I'm pretty sure
that after some initial phase it can be relatively stable, e.g. changing only
with some _major_ changes in the scheduler code.

> 
> > 
> > > 
> > > > - Only a few workloads are large and sensitive enough to merit their own
> > > >   policy tweaks. CFS by itself should be good enough for everything else,
> > > >   and we probably do not want policy tweaks to be a replacement for anything
> > > >   CFS does.
> > > > 
> > > > This leads to BPF hooks, which have been successfully used in various
> > > > kernel subsystems to provide a way for external code to (safely)
> > > > change a few kernel decisions. BPF tooling makes this pretty easy to do,
> > > > and the people deploying BPF scripts are already quite used to updating them
> > > > for new kernel versions.
> > > 
> > > I am (very) wary of these hooks. Scheduler (in mobile at least) is an area that
> > > gets heavily modified by vendors and OEMs. We try very hard to understand the
> > > problems they face and get the right set of solutions in mainline. Which would
> > > ultimately help towards the goal of having a single Generic kernel Image [1]
> > > that gives you what you'd expect out of the platform without any need for
> > > additional cherries on top.
> > 
> > Wouldn't it make your life easier had they provide a set of bpf programs instead
> > of custom patches?
> 
> Not really.
> 
> Having consistent mainline behavior is important, and these customization
> contribute to fragmentation and can throw off userspace developers who find
> they have to do extra work on some platforms to get the desired outcome. They
> will be easy to misuse. We want to see the patches and find ways to improve
> mainline kernel instead.
> 
> That said, I can see the use case of being able to micro-optimize part of the
> scheduler in a workload specific way. But then the way I see this support
> happening (DISCLAIMER, personal opinion :-))
> 
> 	1. The hooks have to be about replacing specific snippet, like Barry's
> 	   example where it's an area that is hard to find a generic solution
> 	   that doesn't have a drawback over a class of workloads.

This makes sense to me, and this is a good topic to discuss: which hooks do we
really need. I don't think it necessarily has to replace something, but I
totally agree on the "hard to find a generic solution" part.

> 
> 	2. The set of bpf programs that modify it live in the kernel tree for
> 	   each hook added. Then we can reason about why the hook is there and
> 	   allow others to reap the benefit. Beside being able to re-evaluate
> 	   easily if the users still need that hook after a potential
> 	   improvement that could render it unnecessary.
> 
> 	3. Out of tree bpf programs can only be loaded if special CONFIG option
> 	   is set so that production kernel can only load known ones that the
> 	   community knows and have reasoned about.
> 
> 	4. Out of tree bpf programs will taint the kernel. A regression
> 	   reported with something funny loaded should be flagged as
> 	   potentially bogus.

2-4 look as generic bpf questions to me, I don't think there is anything
scheduler-specific. So I'd suggest to bring bpf maintainers into the discussion,
their input can be very valuable.

> 
> IMHO this should tame the beast to something useful to address these situations
> where the change required to improve one workload will harm others and it's
> hard to come up with a good compromise. Then the hook as you suggest could help
> implement that policy specifically for that platform/workload.
> 
> One can note that the behavior I suggest is similar to how modules work :)

The important benefit of bpf is safety guarantees.

> 
> > 
> > > 
> > > So my worry is that this will open the gate for these hooks to get more than
> > > just micro-optimization done in a platform specific way. And that it will
> > > discourage having the right discussion to fix real problems in the scheduler
> > > because the easy path is to do whatever you want in userspace. I am not sure we
> > > can control how these hooks are used.
> > 
> > I totally understand your worry. I think we need to find a right balance between
> > allowing to implement custom policies and keeping the core functionality
> > working well enough for everybody without a need to tweak anything.
> > 
> > It seems like an alternative to this "let's allow cfs customization via bpf"
> > approach is to completely move the scheduler code into userspace/bpf, something
> > that Google's ghOSt is aiming to do.
> 
> Why not ship a custom kernel instead then?

Shipping a custom kernel (actually any kernel) at this scale isn't easy or fast.
Just for example, imagine a process of rebooting of a 1000000 machines running
1000's different workloads, each with their own redundancy and capacity requirements.

This what makes an ability to push scheduler changes without a reboot/kernel upgrade
so attractive.

Obviously, it's not a case when we talk about a single kernel engineer and their
laptop/dev server/vm.

> 
> > 
> > > 
> > > The question is: why can't we fix any issues in the scheduler/make it better
> > > and must have these hooks instead?
> > 
> > Of course, if it's possible to implement an idea in a form which is suitable
> > for everybody and upstream it, this is the best outcome. The problem is that
> > not every idea is like that. A bpf program can leverage a priori knowledge
> > of a workload and its needs, something the generic scheduler code lacks
> > by the definition.
> 
> Yep I see your point for certain aspects of the scheduler that are hard to tune
> universally. We just need to be careful not to end up in a wild west or Anything
> Can Happen Thursday situation :-)

Totally agree!

Thanks!

Qais Yousef Oct. 12, 2021, 10:16 a.m. UTC | #6

On 10/11/21 11:09, Roman Gushchin wrote:
> > Convenient will be only true assuming you have a full comprehensive list of
> > hooks to never require adding a new one. As I highlighted above, this
> > convenience is limited to hooks that you added now.
> > 
> > Do people always want more hooks? Rhetorical question ;-)
> 
> Why do you think that the list of the hooks will be so large/dynamic?

It's not a fact. Just my thoughts/guess based on how things usually end up.
It's very likely this will grow. I could be wrong of course :)

> I'm not saying we can figure it out from a first attempt, but I'm pretty sure
> that after some initial phase it can be relatively stable, e.g. changing only
> with some _major_ changes in the scheduler code.

My point was that the speed up in workflow will be limited by the what's
available. It might be enough for a large use cases as you say, but at some
point there will be a new bottleneck that you might think worth experimenting
with and the chances a suitable hook is available are 50:50 in theory. So it's
not a magical fix where one would *never* have to push a custom kernel on all
these systems to experiment with some scheduler changes.

> > > > So my worry is that this will open the gate for these hooks to get more than
> > > > just micro-optimization done in a platform specific way. And that it will
> > > > discourage having the right discussion to fix real problems in the scheduler
> > > > because the easy path is to do whatever you want in userspace. I am not sure we
> > > > can control how these hooks are used.
> > > 
> > > I totally understand your worry. I think we need to find a right balance between
> > > allowing to implement custom policies and keeping the core functionality
> > > working well enough for everybody without a need to tweak anything.
> > > 
> > > It seems like an alternative to this "let's allow cfs customization via bpf"
> > > approach is to completely move the scheduler code into userspace/bpf, something
> > > that Google's ghOSt is aiming to do.
> > 
> > Why not ship a custom kernel instead then?
> 
> Shipping a custom kernel (actually any kernel) at this scale isn't easy or fast.
> Just for example, imagine a process of rebooting of a 1000000 machines running
> 1000's different workloads, each with their own redundancy and capacity requirements.
> 
> This what makes an ability to push scheduler changes without a reboot/kernel upgrade
> so attractive.
> 
> Obviously, it's not a case when we talk about a single kernel engineer and their
> laptop/dev server/vm.

I think you're still referring to ghOSt here. I thought your 2 use cases are
different as you mentioned they "completely move the scheduler code into
userspace/bpf"; but it could be just me mis-interpreting what this means. That
didn't read to me they want to micro-optimize (few) certain decisions in the
scheduler, rather replace it altogether, hence my question.

Anyway. My 2cents here is that we should be careful not to introduce something
that encourages out-of-tree workarounds for real scheduler problems nor have it
done in a way where we lose visibility over how these hooks are used and being
able to share it with others who could benefit from the same mico-optimization
too.

Thanks!

--
Qais Yousef

Yafang Shao Nov. 25, 2021, 6 a.m. UTC | #7

Hi Roman,

Scheduler BPF is a great idea.
Thanks for the work.

Scheduler BPF won’t be a small feature,  I think we’d better give a
summary of possible hooks it may add first.
We must have a *basic rule* to control what it will tend to be to
avoid adding BPF hooks here and there.
I haven’t found a clear rule yet, but maybe we can learn it from
netfilter, which has 5 basic hooks.
Regarding the scheduler BPF hooks, some possible basic hooks may be:
  - Hook for Enqueue
  - Hook for Dequeue
  - Hook for Put Prev Task
   - Hook for Set Next Task


> An example of an userspace part, which loads some simple hooks is available
> here [3]. It's very simple, provided only to simplify playing with the provided
> kernel patches.
>

You’d better add this userspace code into samples/bpf/.


[Some error occurs in my mail client, so I resend it]


--
Thanks
Yafang

Roman Gushchin Nov. 26, 2021, 7:46 p.m. UTC | #8

On Thu, Nov 25, 2021 at 02:00:04PM +0800, Yafang Shao wrote:
> Hi Roman,

Hi Yafang!

> 
> Scheduler BPF is a great idea.
> Thanks for the work.

Thanks!

> 
> Scheduler BPF won’t be a small feature,  I think we’d better give a
> summary of possible hooks it may add first.
> We must have a *basic rule* to control what it will tend to be to
> avoid adding BPF hooks here and there.
> I haven’t found a clear rule yet, but maybe we can learn it from
> netfilter, which has 5 basic hooks.
> Regarding the scheduler BPF hooks, some possible basic hooks may be:
>   - Hook for Enqueue
>   - Hook for Dequeue
>   - Hook for Put Prev Task
>    - Hook for Set Next Task

I think it depends on what we want to achieve. There are several options:
we might aim to implement the whole scheduler logic in bpf, we might aim
to do some adjustments to the existing scheduler behavior or a mix of those
approaches.

Bpf as now is now is not capable enough to implement a new scheduler class
without a substantial amount of new c code (in form of helpers, maybe custom
maps, some verifier changes etc). In particular, it's a challenging to
provide strong safety guarantees: any scheduler bpf program loaded shouldn't
crash or deadlock the system (otherwise bpf isn't any better than a kernel
module). Also performance margins are quite tight.

I'm not saying that providing such generic hooks is impossible or useless,
but it requires a lot of changes and support code and I'm not sure that we have
a good justification for them right now.

I think instead we might want to see bpf hooks as a better form of (sysctl)
tunables, which are more flexible (e.g. can be used for specific processes,
cgroups, cpus, being enabled depending on load, weather, etc) and do not create
an ABI (so are easier to maintain).

> 
> 
> > An example of an userspace part, which loads some simple hooks is available
> > here [3]. It's very simple, provided only to simplify playing with the provided
> > kernel patches.
> >
> 
> You’d better add this userspace code into samples/bpf/.

I thought samples/bpf was considered deprecated (in favor to selftests/bpf/),
but I'm gonna check with bpf maintainers. Thanks for the idea!

Huichun Feng Jan. 15, 2022, 8:29 a.m. UTC | #9

Hi Roman and the list,

I have a naive question regarding BPF hook for sched.

Given that BPF can also be attached to tracepoint, why do we add a BPF prog
type specific to sched?

The reason I can come up with is that sched BPF can have retval to drive the
scheduling decision in static branch, whereas tracepoint is not able to do this.
Is it mainly because of this or anything else?


Thanks
Huichun

Roman Gushchin Jan. 18, 2022, 10:54 p.m. UTC | #10

On Sat, Jan 15, 2022 at 04:29:24PM +0800, Huichun Feng wrote:
> Hi Roman and the list,

Hello Huichun!

> 
> I have a naive question regarding BPF hook for sched.
> 
> Given that BPF can also be attached to tracepoint, why do we add a BPF prog
> type specific to sched?

Tracing programs can have return values as well, see kretprobes.

> 
> The reason I can come up with is that sched BPF can have retval to drive the
> scheduling decision in static branch, whereas tracepoint is not able to do this.
> Is it mainly because of this or anything else?

Well, you are right that right now there is no strict necessity to
introduce a new prog type (aside from static branch mechanism you
mentioned), however I believe it's useful in a long run. Sched
programs might be able to use a different set of helpers, maybe there
will be some additional restrictions, etc. It's an RFC version of the
patchset and any ideas, suggestions and critic are highly welcome!

Thanks!

Ren Zhijie July 19, 2022, 1:05 p.m. UTC | #11

Hi Roman and list,

We want to implement a programmable scheduler to meet the schedule 
requirements of different workloads.

Using BPF, we can easily deploy schedule policies for specific 
workloads, quickly verifying without modifying the kernel code. This 
greatly reduces the cost of deploying new schedule policies in the 
production environment.

Therefore, we want to continue to develop based on your patch. We plan 
to merge it into the openeuler open-source community and use the 
community to continuously evolve and maintain it.
(link: https://www.openeuler.org/en/)

We made some changes to your patch:
1. Adapt to the openeuler-OLK-5.10 branch, which mostly base on linux 
longterm branch 5.10.
2. Introduce the Kconfig CONFIG_BPF_SCHED to isolate related code at 
compile time.
3. helpers bpf_sched_entity_to_cgrpid() and 
bpf_sched_entity_belongs_to_cgrp() are modified to obtain the task group 
to which the sched entity belongs through se->my_q->tg->css.cgroup.

We have some ideas for the next iteration of Scheduler BPF that we would 
like to share with you:
1.The tag field is added to struct task_struct and struct task_group. 
Users can use the file system interface to mark different tags for 
specific workloads. The bpf prog obtains the tags to detect different 
workloads.
2.Add BPF hook and helper to scheduling processes such as select_task_rq 
and pick_next_task to enable scalability.

It's a new attempt, and there's bound to be a lot of problems later, but 
it's exciting that it makes the schduler programmable.

cheers,
Ren Zhijie

Ren Zhijie July 19, 2022, 1:17 p.m. UTC | #12

Hi Roman and list,

We want to implement a programmable scheduler to meet the schedule 
requirements of different workloads.

Using BPF, we can easily deploy schedule policies for specific 
workloads, quickly verifying without modifying the kernel code. This 
greatly reduces the cost of deploying new schedule policies in the 
production environment.

Therefore, we want to continue to develop based on your patch. We plan 
to merge it into the openeuler open-source community and use the 
community to continuously evolve and maintain it.
(link: https://www.openeuler.org/en/)

We made some changes to your patch:
1. Adapt to the openeuler-OLK-5.10 branch, which mostly base on linux 
longterm branch 5.10.
2. Introduce the Kconfig CONFIG_BPF_SCHED to isolate related code at 
compile time.
3. helpers bpf_sched_entity_to_cgrpid() and 
bpf_sched_entity_belongs_to_cgrp() are modified to obtain the task group 
to which the sched entity belongs through se->my_q->tg->css.cgroup.

We have some ideas for the next iteration of Scheduler BPF that we would 
like to share with you:
1.The tag field is added to struct task_struct and struct task_group. 
Users can use the file system interface to mark different tags for 
specific workloads. The bpf prog obtains the tags to detect different 
workloads.
2.Add BPF hook and helper to scheduling processes such as select_task_rq 
and pick_next_task to enable scalability.

It's a new attempt, and there's bound to be a lot of problems later, but 
it's exciting that it makes the schduler programmable.

cheers,
Ren Zhijie

Roman Gushchin July 19, 2022, 11:21 p.m. UTC | #13

On Tue, Jul 19, 2022 at 09:17:24PM +0800, Ren Zhijie wrote:
> Hi Roman and list,
> 
> We want to implement a programmable scheduler to meet the schedule
> requirements of different workloads.
> 
> Using BPF, we can easily deploy schedule policies for specific workloads,
> quickly verifying without modifying the kernel code. This greatly reduces
> the cost of deploying new schedule policies in the production environment.
> 
> Therefore, we want to continue to develop based on your patch. We plan to
> merge it into the openeuler open-source community and use the community to
> continuously evolve and maintain it.
> (link: https://www.openeuler.org/en/)
> 
> We made some changes to your patch:
> 1. Adapt to the openeuler-OLK-5.10 branch, which mostly base on linux
> longterm branch 5.10.
> 2. Introduce the Kconfig CONFIG_BPF_SCHED to isolate related code at compile
> time.
> 3. helpers bpf_sched_entity_to_cgrpid() and
> bpf_sched_entity_belongs_to_cgrp() are modified to obtain the task group to
> which the sched entity belongs through se->my_q->tg->css.cgroup.
> 
> We have some ideas for the next iteration of Scheduler BPF that we would
> like to share with you:
> 1.The tag field is added to struct task_struct and struct task_group. Users
> can use the file system interface to mark different tags for specific
> workloads. The bpf prog obtains the tags to detect different workloads.
> 2.Add BPF hook and helper to scheduling processes such as select_task_rq and
> pick_next_task to enable scalability.
> 
> It's a new attempt, and there's bound to be a lot of problems later, but
> it's exciting that it makes the schduler programmable.

Hi Ren!

Great to hear my work is useful and thank you for describing your plans!
I'm not actively working on it right now, but I might start again in the future.
Let me know if I can help you with this effort.

Thanks!

[rfc,0/6] Scheduler BPF

Message

Comments