diff mbox series

[RFC,v2,1/5] mm, oom: Introduce bpf_oom_evaluate_task

Message ID 20230810081319.65668-2-zhouchuyi@bytedance.com (mailing list archive)
State RFC
Headers show
Series mm: Select victim using bpf_oom_evaluate_task | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }}
bpf/vmtest-bpf-next-VM_Test-2 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-5 fail Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-7 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-8 success Logs for veristat
netdev/tree_selection success Not a local patch, async

Commit Message

Chuyi Zhou Aug. 10, 2023, 8:13 a.m. UTC
This patch adds a new hook bpf_oom_evaluate_task in oom_evaluate_task. It
takes oc and current iterating task as parameters and returns a result
indicating which one should be selected. We can use it to bypass the
current logic of oom_evaluate_task and implement customized OOM policies
in the attached BPF progams.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 mm/oom_kill.c | 59 +++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 50 insertions(+), 9 deletions(-)

Comments

Alexei Starovoitov Aug. 17, 2023, 2:07 a.m. UTC | #1
On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>  static int oom_evaluate_task(struct task_struct *task, void *arg)
>  {
>         struct oom_control *oc = arg;
> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>         if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
>                 goto next;
>
> +       /*
> +        * If task is allocating a lot of memory and has been marked to be
> +        * killed first if it triggers an oom, then select it.
> +        */
> +       if (oom_task_origin(task)) {
> +               points = LONG_MAX;
> +               goto select;
> +       }
> +
> +       switch (bpf_oom_evaluate_task(task, oc)) {
> +       case BPF_EVAL_ABORT:
> +               goto abort; /* abort search process */
> +       case BPF_EVAL_NEXT:
> +               goto next; /* ignore the task */
> +       case BPF_EVAL_SELECT:
> +               goto select; /* select the task */
> +       default:
> +               break; /* No BPF policy */
> +       }
> +

I think forcing bpf prog to look at every task is going to be limiting
long term.
It's more flexible to invoke bpf prog from out_of_memory()
and if it doesn't choose a task then fallback to select_bad_process().
I believe that's what Roman was proposing.
bpf can choose to iterate memcg or it might have some side knowledge
that there are processes that can be set as oc->chosen right away,
so it can skip the iteration.
Chuyi Zhou Aug. 17, 2023, 2:51 a.m. UTC | #2
Hello,

在 2023/8/17 10:07, Alexei Starovoitov 写道:
> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>>   static int oom_evaluate_task(struct task_struct *task, void *arg)
>>   {
>>          struct oom_control *oc = arg;
>> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>>          if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
>>                  goto next;
>>
>> +       /*
>> +        * If task is allocating a lot of memory and has been marked to be
>> +        * killed first if it triggers an oom, then select it.
>> +        */
>> +       if (oom_task_origin(task)) {
>> +               points = LONG_MAX;
>> +               goto select;
>> +       }
>> +
>> +       switch (bpf_oom_evaluate_task(task, oc)) {
>> +       case BPF_EVAL_ABORT:
>> +               goto abort; /* abort search process */
>> +       case BPF_EVAL_NEXT:
>> +               goto next; /* ignore the task */
>> +       case BPF_EVAL_SELECT:
>> +               goto select; /* select the task */
>> +       default:
>> +               break; /* No BPF policy */
>> +       }
>> +
> 
> I think forcing bpf prog to look at every task is going to be limiting
> long term.
> It's more flexible to invoke bpf prog from out_of_memory()
> and if it doesn't choose a task then fallback to select_bad_process().
> I believe that's what Roman was proposing.
> bpf can choose to iterate memcg or it might have some side knowledge
> that there are processes that can be set as oc->chosen right away,
> so it can skip the iteration.

IIUC, We may need some new bpf features if we want to iterating 
tasks/memcg in BPF, sush as:
bpf_for_each_task
bpf_for_each_memcg
bpf_for_each_task_in_memcg
...

It seems we have some work to do first in the BPF side.
Will these iterating features be useful in other BPF scenario except OOM 
Policy?

Thanks.
Alexei Starovoitov Aug. 17, 2023, 3:22 a.m. UTC | #3
On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>
> Hello,
>
> 在 2023/8/17 10:07, Alexei Starovoitov 写道:
> > On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> >>   static int oom_evaluate_task(struct task_struct *task, void *arg)
> >>   {
> >>          struct oom_control *oc = arg;
> >> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> >>          if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
> >>                  goto next;
> >>
> >> +       /*
> >> +        * If task is allocating a lot of memory and has been marked to be
> >> +        * killed first if it triggers an oom, then select it.
> >> +        */
> >> +       if (oom_task_origin(task)) {
> >> +               points = LONG_MAX;
> >> +               goto select;
> >> +       }
> >> +
> >> +       switch (bpf_oom_evaluate_task(task, oc)) {
> >> +       case BPF_EVAL_ABORT:
> >> +               goto abort; /* abort search process */
> >> +       case BPF_EVAL_NEXT:
> >> +               goto next; /* ignore the task */
> >> +       case BPF_EVAL_SELECT:
> >> +               goto select; /* select the task */
> >> +       default:
> >> +               break; /* No BPF policy */
> >> +       }
> >> +
> >
> > I think forcing bpf prog to look at every task is going to be limiting
> > long term.
> > It's more flexible to invoke bpf prog from out_of_memory()
> > and if it doesn't choose a task then fallback to select_bad_process().
> > I believe that's what Roman was proposing.
> > bpf can choose to iterate memcg or it might have some side knowledge
> > that there are processes that can be set as oc->chosen right away,
> > so it can skip the iteration.
>
> IIUC, We may need some new bpf features if we want to iterating
> tasks/memcg in BPF, sush as:
> bpf_for_each_task
> bpf_for_each_memcg
> bpf_for_each_task_in_memcg
> ...
>
> It seems we have some work to do first in the BPF side.
> Will these iterating features be useful in other BPF scenario except OOM
> Policy?

Yes.
Use open coded iterators though.
Like example in
https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@fb.com/

bpf_for_each(task_vma, vma, task, 0) { ... }
will safely iterate vma-s of the task.
Similarly struct css_task_iter can be hidden inside bpf open coded iterator.
Chuyi Zhou Aug. 18, 2023, 3:30 a.m. UTC | #4
Hello,
在 2023/8/17 11:22, Alexei Starovoitov 写道:
> On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>>
>> Hello,
>>
>> 在 2023/8/17 10:07, Alexei Starovoitov 写道:
>>> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>>>>    static int oom_evaluate_task(struct task_struct *task, void *arg)
>>>>    {
>>>>           struct oom_control *oc = arg;
>>>> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>>>>           if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
>>>>                   goto next;
>>>>
>>>> +       /*
>>>> +        * If task is allocating a lot of memory and has been marked to be
>>>> +        * killed first if it triggers an oom, then select it.
>>>> +        */
>>>> +       if (oom_task_origin(task)) {
>>>> +               points = LONG_MAX;
>>>> +               goto select;
>>>> +       }
>>>> +
>>>> +       switch (bpf_oom_evaluate_task(task, oc)) {
>>>> +       case BPF_EVAL_ABORT:
>>>> +               goto abort; /* abort search process */
>>>> +       case BPF_EVAL_NEXT:
>>>> +               goto next; /* ignore the task */
>>>> +       case BPF_EVAL_SELECT:
>>>> +               goto select; /* select the task */
>>>> +       default:
>>>> +               break; /* No BPF policy */
>>>> +       }
>>>> +
>>>
>>> I think forcing bpf prog to look at every task is going to be limiting
>>> long term.
>>> It's more flexible to invoke bpf prog from out_of_memory()
>>> and if it doesn't choose a task then fallback to select_bad_process().
>>> I believe that's what Roman was proposing.
>>> bpf can choose to iterate memcg or it might have some side knowledge
>>> that there are processes that can be set as oc->chosen right away,
>>> so it can skip the iteration.
>>
>> IIUC, We may need some new bpf features if we want to iterating
>> tasks/memcg in BPF, sush as:
>> bpf_for_each_task
>> bpf_for_each_memcg
>> bpf_for_each_task_in_memcg
>> ...
>>
>> It seems we have some work to do first in the BPF side.
>> Will these iterating features be useful in other BPF scenario except OOM
>> Policy?
> 
> Yes.
> Use open coded iterators though.
> Like example in
> https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@fb.com/
> 
> bpf_for_each(task_vma, vma, task, 0) { ... }
> will safely iterate vma-s of the task.
> Similarly struct css_task_iter can be hidden inside bpf open coded iterator.
OK. I think the following APIs whould be useful and I am willing to 
start with these in another bpf-next RFC patchset:

1. bpf_for_each(task). Just like for_each_process(p) in kernel to 
itearing all tasks in the system with rcu_read_lock().

2. bpf_for_each(css_task, task, css). It works like 
css_task_iter_{start, next, end} and would be used to iterating 
tasks/threads under a css.

3. bpf_for_each(descendant_css, css, root_css, {PRE, POST}). It works 
like css_next_descendant_{pre, post} to iterating all descendant.

If you have better ideas or any advice, please let me know.
Thanks.
Alexei Starovoitov Aug. 18, 2023, 4:34 a.m. UTC | #5
On Thu, Aug 17, 2023 at 8:30 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>
> Hello,
> 在 2023/8/17 11:22, Alexei Starovoitov 写道:
> > On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> >>
> >> Hello,
> >>
> >> 在 2023/8/17 10:07, Alexei Starovoitov 写道:
> >>> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> >>>>    static int oom_evaluate_task(struct task_struct *task, void *arg)
> >>>>    {
> >>>>           struct oom_control *oc = arg;
> >>>> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> >>>>           if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
> >>>>                   goto next;
> >>>>
> >>>> +       /*
> >>>> +        * If task is allocating a lot of memory and has been marked to be
> >>>> +        * killed first if it triggers an oom, then select it.
> >>>> +        */
> >>>> +       if (oom_task_origin(task)) {
> >>>> +               points = LONG_MAX;
> >>>> +               goto select;
> >>>> +       }
> >>>> +
> >>>> +       switch (bpf_oom_evaluate_task(task, oc)) {
> >>>> +       case BPF_EVAL_ABORT:
> >>>> +               goto abort; /* abort search process */
> >>>> +       case BPF_EVAL_NEXT:
> >>>> +               goto next; /* ignore the task */
> >>>> +       case BPF_EVAL_SELECT:
> >>>> +               goto select; /* select the task */
> >>>> +       default:
> >>>> +               break; /* No BPF policy */
> >>>> +       }
> >>>> +
> >>>
> >>> I think forcing bpf prog to look at every task is going to be limiting
> >>> long term.
> >>> It's more flexible to invoke bpf prog from out_of_memory()
> >>> and if it doesn't choose a task then fallback to select_bad_process().
> >>> I believe that's what Roman was proposing.
> >>> bpf can choose to iterate memcg or it might have some side knowledge
> >>> that there are processes that can be set as oc->chosen right away,
> >>> so it can skip the iteration.
> >>
> >> IIUC, We may need some new bpf features if we want to iterating
> >> tasks/memcg in BPF, sush as:
> >> bpf_for_each_task
> >> bpf_for_each_memcg
> >> bpf_for_each_task_in_memcg
> >> ...
> >>
> >> It seems we have some work to do first in the BPF side.
> >> Will these iterating features be useful in other BPF scenario except OOM
> >> Policy?
> >
> > Yes.
> > Use open coded iterators though.
> > Like example in
> > https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@fb.com/
> >
> > bpf_for_each(task_vma, vma, task, 0) { ... }
> > will safely iterate vma-s of the task.
> > Similarly struct css_task_iter can be hidden inside bpf open coded iterator.
> OK. I think the following APIs whould be useful and I am willing to
> start with these in another bpf-next RFC patchset:
>
> 1. bpf_for_each(task). Just like for_each_process(p) in kernel to
> itearing all tasks in the system with rcu_read_lock().
>
> 2. bpf_for_each(css_task, task, css). It works like
> css_task_iter_{start, next, end} and would be used to iterating
> tasks/threads under a css.
>
> 3. bpf_for_each(descendant_css, css, root_css, {PRE, POST}). It works
> like css_next_descendant_{pre, post} to iterating all descendant.
>
> If you have better ideas or any advice, please let me know.

Sounds great. Such 3 new iterators are unrelated to oom discussion and
can be developed/landed in parallel.
They will be useful in other bpf programs.
Michal Hocko Aug. 22, 2023, 10:39 a.m. UTC | #6
On Wed 16-08-23 19:07:10, Alexei Starovoitov wrote:
> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> >  static int oom_evaluate_task(struct task_struct *task, void *arg)
> >  {
> >         struct oom_control *oc = arg;
> > @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> >         if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
> >                 goto next;
> >
> > +       /*
> > +        * If task is allocating a lot of memory and has been marked to be
> > +        * killed first if it triggers an oom, then select it.
> > +        */
> > +       if (oom_task_origin(task)) {
> > +               points = LONG_MAX;
> > +               goto select;
> > +       }
> > +
> > +       switch (bpf_oom_evaluate_task(task, oc)) {
> > +       case BPF_EVAL_ABORT:
> > +               goto abort; /* abort search process */
> > +       case BPF_EVAL_NEXT:
> > +               goto next; /* ignore the task */
> > +       case BPF_EVAL_SELECT:
> > +               goto select; /* select the task */
> > +       default:
> > +               break; /* No BPF policy */
> > +       }
> > +
> 
> I think forcing bpf prog to look at every task is going to be limiting
> long term.
> It's more flexible to invoke bpf prog from out_of_memory()
> and if it doesn't choose a task then fallback to select_bad_process().
> I believe that's what Roman was proposing.
> bpf can choose to iterate memcg or it might have some side knowledge
> that there are processes that can be set as oc->chosen right away,
> so it can skip the iteration.

This is certainly possible but I am worried this will lead to a lot of
duplication. There are common tasks that all/most oom victim selection
implementations should do. First of all they should make sure that the
victim is belonging to the oom domain. Arguably it should be also aware
of ongoing oom victim tear down to prevent from overkilling. Proper oom
victim reference counting handling. Most people are not even aware of
those things. Do we really want all those to be re-invented - most
likely incorrectly?

Advantage of reusing oom_evaluate_task is that all that can be avoided.
Iterating over tasks with a pre-defined oom-victim sure sounds like
unnecessary and wasted CPU cycles but if you want to prevent
over-killing this is still necessary. As the oom killer can be invoked
really rapidly (before the victim has a chance to die) I believe this is
a very useful feature.
Bixuan Cui Sept. 13, 2023, 1:18 a.m. UTC | #7
在 2023/8/10 16:13, Chuyi Zhou 写道:
> +#include <linux/bpf.h> #include <linux/oom.h> #include <linux/mm.h> 
> #include <linux/err.h> @@ -305,6 +306,27 @@ static enum oom_constraint 
> constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } 
> +enum { + NO_BPF_POLICY, + BPF_EVAL_ABORT, + BPF_EVAL_NEXT, + 
> BPF_EVAL_SELECT, +}; +

I saw that tools/testing/selftests/bpf/progs/oom_policy.c is also using 
NO_BPF_POLICY etc. I think
+enum {
+	NO_BPF_POLICY,
+	BPF_EVAL_ABORT,
+	BPF_EVAL_NEXT,
+	BPF_EVAL_SELECT,
+};
+
definitions can be placed in include/linux/oom.h

Thanks
Bixuan Cui
Chuyi Zhou Sept. 13, 2023, 8 a.m. UTC | #8
Hello, Bixuan.

在 2023/9/13 09:18, Bixuan Cui 写道:
> 
> 
> 在 2023/8/10 16:13, Chuyi Zhou 写道:
>> +#include <linux/bpf.h> #include <linux/oom.h> #include <linux/mm.h> 
>> #include <linux/err.h> @@ -305,6 +306,27 @@ static enum oom_constraint 
>> constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } 
>> +enum { + NO_BPF_POLICY, + BPF_EVAL_ABORT, + BPF_EVAL_NEXT, + 
>> BPF_EVAL_SELECT, +}; +
> 
> I saw that tools/testing/selftests/bpf/progs/oom_policy.c is also using 
> NO_BPF_POLICY etc. I think
> +enum {
> +    NO_BPF_POLICY,
> +    BPF_EVAL_ABORT,
> +    BPF_EVAL_NEXT,
> +    BPF_EVAL_SELECT,
> +};
> +
> definitions can be placed in include/linux/oom.h
> 

Thanks for your feedback!

Yes, Maybe we should move these enums to a more proper place so that 
they can be generated by BTF and we can take them from vmlinux.h.

> Thanks
> Bixuan Cui
Bixuan Cui Sept. 13, 2023, 11:24 a.m. UTC | #9
在 2023/8/10 16:13, Chuyi Zhou 写道:
> + +__weak noinline int bpf_oom_evaluate_task(struct task_struct *task, 
> struct oom_control *oc) +{ + return NO_BPF_POLICY; +} + 
> +BTF_SET8_START(oom_bpf_fmodret_ids) +BTF_ID_FLAGS(func, 
> bpf_oom_evaluate_task) +BTF_SET8_END(oom_bpf_fmodret_ids)
I have a question here, why use __weak? Is there other modules that can 
redefine bpf_oom_evaluate_task? why not use __bpf_kfunc 
(Documentation/bpf/kfuncs.rst) ?

Thanks
Bixuan Cui
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 612b5597d3af..255c9ef1d808 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -18,6 +18,7 @@ 
  *  kernel subsystems and hints as to where to find out what things do.
  */
 
+#include <linux/bpf.h>
 #include <linux/oom.h>
 #include <linux/mm.h>
 #include <linux/err.h>
@@ -305,6 +306,27 @@  static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
+enum {
+	NO_BPF_POLICY,
+	BPF_EVAL_ABORT,
+	BPF_EVAL_NEXT,
+	BPF_EVAL_SELECT,
+};
+
+__weak noinline int bpf_oom_evaluate_task(struct task_struct *task, struct oom_control *oc)
+{
+	return NO_BPF_POLICY;
+}
+
+BTF_SET8_START(oom_bpf_fmodret_ids)
+BTF_ID_FLAGS(func, bpf_oom_evaluate_task)
+BTF_SET8_END(oom_bpf_fmodret_ids)
+
+static const struct btf_kfunc_id_set oom_bpf_fmodret_set = {
+	.owner = THIS_MODULE,
+	.set   = &oom_bpf_fmodret_ids,
+};
+
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
@@ -317,6 +339,26 @@  static int oom_evaluate_task(struct task_struct *task, void *arg)
 	if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
 		goto next;
 
+	/*
+	 * If task is allocating a lot of memory and has been marked to be
+	 * killed first if it triggers an oom, then select it.
+	 */
+	if (oom_task_origin(task)) {
+		points = LONG_MAX;
+		goto select;
+	}
+
+	switch (bpf_oom_evaluate_task(task, oc)) {
+	case BPF_EVAL_ABORT:
+		goto abort; /* abort search process */
+	case BPF_EVAL_NEXT:
+		goto next; /* ignore the task */
+	case BPF_EVAL_SELECT:
+		goto select; /* select the task */
+	default:
+		break; /* No BPF policy */
+	}
+
 	/*
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves unless
@@ -329,15 +371,6 @@  static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto abort;
 	}
 
-	/*
-	 * If task is allocating a lot of memory and has been marked to be
-	 * killed first if it triggers an oom, then select it.
-	 */
-	if (oom_task_origin(task)) {
-		points = LONG_MAX;
-		goto select;
-	}
-
 	points = oom_badness(task, oc->totalpages);
 	if (points == LONG_MIN || points < oc->chosen_points)
 		goto next;
@@ -732,10 +765,18 @@  static struct ctl_table vm_oom_kill_table[] = {
 
 static int __init oom_init(void)
 {
+	int err;
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
 #ifdef CONFIG_SYSCTL
 	register_sysctl_init("vm", vm_oom_kill_table);
 #endif
+
+#ifdef CONFIG_BPF_SYSCALL
+	err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set);
+	if (err)
+		pr_warn("error while registering oom fmodret entrypoints: %d", err);
+#endif
+
 	return 0;
 }
 subsys_initcall(oom_init)