[RFC,v2,1/5] mm, oom: Introduce bpf_oom_evaluate_task

Message ID	20230810081319.65668-2-zhouchuyi@bytedance.com (mailing list archive)
State	RFC
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AFF191ADE8 for <bpf@vger.kernel.org>; Thu, 10 Aug 2023 08:13:34 +0000 (UTC) From: Chuyi Zhou <zhouchuyi@bytedance.com> To: hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, muchun.song@linux.dev Cc: bpf@vger.kernel.org, linux-kernel@vger.kernel.org, wuyun.abel@bytedance.com, robin.lu@bytedance.com, Chuyi Zhou <zhouchuyi@bytedance.com>, Michal Hocko <mhocko@suse.com> Subject: [RFC PATCH v2 1/5] mm, oom: Introduce bpf_oom_evaluate_task Date: Thu, 10 Aug 2023 16:13:15 +0800 Message-Id: <20230810081319.65668-2-zhouchuyi@bytedance.com> In-Reply-To: <20230810081319.65668-1-zhouchuyi@bytedance.com> References: <20230810081319.65668-1-zhouchuyi@bytedance.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	mm: Select victim using bpf_oom_evaluate_task \| expand [RFC,v2,0/5] mm: Select victim using bpf_oom_evaluate_task [RFC,v2,1/5] mm, oom: Introduce bpf_oom_evaluate_task [RFC,v2,2/5] mm: Add policy_name to identify OOM policies [RFC,v2,3/5] mm: Add a tracepoint when OOM victim selection is failed [RFC,v2,4/5] bpf: Add a OOM policy test [RFC,v2,5/5] bpf: Add a BPF OOM policy Doc

Context	Check	Description
bpf/vmtest-bpf-next-PR	success	PR summary
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }}
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3	fail	Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4	fail	Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-5	fail	Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6	fail	Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-7	success	Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-8	success	Logs for veristat
netdev/tree_selection	success	Not a local patch, async

Chuyi Zhou Aug. 10, 2023, 8:13 a.m. UTC

This patch adds a new hook bpf_oom_evaluate_task in oom_evaluate_task. It
takes oc and current iterating task as parameters and returns a result
indicating which one should be selected. We can use it to bypass the
current logic of oom_evaluate_task and implement customized OOM policies
in the attached BPF progams.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 mm/oom_kill.c | 59 +++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 50 insertions(+), 9 deletions(-)

Alexei Starovoitov Aug. 17, 2023, 2:07 a.m. UTC | #1

On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>  static int oom_evaluate_task(struct task_struct *task, void *arg)
>  {
>         struct oom_control *oc = arg;
> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>         if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
>                 goto next;
>
> +       /*
> +        * If task is allocating a lot of memory and has been marked to be
> +        * killed first if it triggers an oom, then select it.
> +        */
> +       if (oom_task_origin(task)) {
> +               points = LONG_MAX;
> +               goto select;
> +       }
> +
> +       switch (bpf_oom_evaluate_task(task, oc)) {
> +       case BPF_EVAL_ABORT:
> +               goto abort; /* abort search process */
> +       case BPF_EVAL_NEXT:
> +               goto next; /* ignore the task */
> +       case BPF_EVAL_SELECT:
> +               goto select; /* select the task */
> +       default:
> +               break; /* No BPF policy */
> +       }
> +

I think forcing bpf prog to look at every task is going to be limiting
long term.
It's more flexible to invoke bpf prog from out_of_memory()
and if it doesn't choose a task then fallback to select_bad_process().
I believe that's what Roman was proposing.
bpf can choose to iterate memcg or it might have some side knowledge
that there are processes that can be set as oc->chosen right away,
so it can skip the iteration.

Chuyi Zhou Aug. 17, 2023, 2:51 a.m. UTC | #2

Hello,

在 2023/8/17 10:07, Alexei Starovoitov 写道:
> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>>   static int oom_evaluate_task(struct task_struct *task, void *arg)
>>   {
>>          struct oom_control *oc = arg;
>> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>>          if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
>>                  goto next;
>>
>> +       /*
>> +        * If task is allocating a lot of memory and has been marked to be
>> +        * killed first if it triggers an oom, then select it.
>> +        */
>> +       if (oom_task_origin(task)) {
>> +               points = LONG_MAX;
>> +               goto select;
>> +       }
>> +
>> +       switch (bpf_oom_evaluate_task(task, oc)) {
>> +       case BPF_EVAL_ABORT:
>> +               goto abort; /* abort search process */
>> +       case BPF_EVAL_NEXT:
>> +               goto next; /* ignore the task */
>> +       case BPF_EVAL_SELECT:
>> +               goto select; /* select the task */
>> +       default:
>> +               break; /* No BPF policy */
>> +       }
>> +
> 
> I think forcing bpf prog to look at every task is going to be limiting
> long term.
> It's more flexible to invoke bpf prog from out_of_memory()
> and if it doesn't choose a task then fallback to select_bad_process().
> I believe that's what Roman was proposing.
> bpf can choose to iterate memcg or it might have some side knowledge
> that there are processes that can be set as oc->chosen right away,
> so it can skip the iteration.

IIUC, We may need some new bpf features if we want to iterating 
tasks/memcg in BPF, sush as:
bpf_for_each_task
bpf_for_each_memcg
bpf_for_each_task_in_memcg
...

It seems we have some work to do first in the BPF side.
Will these iterating features be useful in other BPF scenario except OOM 
Policy?

Thanks.

Alexei Starovoitov Aug. 17, 2023, 3:22 a.m. UTC | #3

On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>
> Hello,
>
> 在 2023/8/17 10:07, Alexei Starovoitov 写道:
> > On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> >>   static int oom_evaluate_task(struct task_struct *task, void *arg)
> >>   {
> >>          struct oom_control *oc = arg;
> >> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> >>          if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
> >>                  goto next;
> >>
> >> +       /*
> >> +        * If task is allocating a lot of memory and has been marked to be
> >> +        * killed first if it triggers an oom, then select it.
> >> +        */
> >> +       if (oom_task_origin(task)) {
> >> +               points = LONG_MAX;
> >> +               goto select;
> >> +       }
> >> +
> >> +       switch (bpf_oom_evaluate_task(task, oc)) {
> >> +       case BPF_EVAL_ABORT:
> >> +               goto abort; /* abort search process */
> >> +       case BPF_EVAL_NEXT:
> >> +               goto next; /* ignore the task */
> >> +       case BPF_EVAL_SELECT:
> >> +               goto select; /* select the task */
> >> +       default:
> >> +               break; /* No BPF policy */
> >> +       }
> >> +
> >
> > I think forcing bpf prog to look at every task is going to be limiting
> > long term.
> > It's more flexible to invoke bpf prog from out_of_memory()
> > and if it doesn't choose a task then fallback to select_bad_process().
> > I believe that's what Roman was proposing.
> > bpf can choose to iterate memcg or it might have some side knowledge
> > that there are processes that can be set as oc->chosen right away,
> > so it can skip the iteration.
>
> IIUC, We may need some new bpf features if we want to iterating
> tasks/memcg in BPF, sush as:
> bpf_for_each_task
> bpf_for_each_memcg
> bpf_for_each_task_in_memcg
> ...
>
> It seems we have some work to do first in the BPF side.
> Will these iterating features be useful in other BPF scenario except OOM
> Policy?

Yes.
Use open coded iterators though.
Like example in
https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@fb.com/

bpf_for_each(task_vma, vma, task, 0) { ... }
will safely iterate vma-s of the task.
Similarly struct css_task_iter can be hidden inside bpf open coded iterator.

Chuyi Zhou Aug. 18, 2023, 3:30 a.m. UTC | #4

Hello,
在 2023/8/17 11:22, Alexei Starovoitov 写道:
> On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>>
>> Hello,
>>
>> 在 2023/8/17 10:07, Alexei Starovoitov 写道:
>>> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>>>>    static int oom_evaluate_task(struct task_struct *task, void *arg)
>>>>    {
>>>>           struct oom_control *oc = arg;
>>>> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>>>>           if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
>>>>                   goto next;
>>>>
>>>> +       /*
>>>> +        * If task is allocating a lot of memory and has been marked to be
>>>> +        * killed first if it triggers an oom, then select it.
>>>> +        */
>>>> +       if (oom_task_origin(task)) {
>>>> +               points = LONG_MAX;
>>>> +               goto select;
>>>> +       }
>>>> +
>>>> +       switch (bpf_oom_evaluate_task(task, oc)) {
>>>> +       case BPF_EVAL_ABORT:
>>>> +               goto abort; /* abort search process */
>>>> +       case BPF_EVAL_NEXT:
>>>> +               goto next; /* ignore the task */
>>>> +       case BPF_EVAL_SELECT:
>>>> +               goto select; /* select the task */
>>>> +       default:
>>>> +               break; /* No BPF policy */
>>>> +       }
>>>> +
>>>
>>> I think forcing bpf prog to look at every task is going to be limiting
>>> long term.
>>> It's more flexible to invoke bpf prog from out_of_memory()
>>> and if it doesn't choose a task then fallback to select_bad_process().
>>> I believe that's what Roman was proposing.
>>> bpf can choose to iterate memcg or it might have some side knowledge
>>> that there are processes that can be set as oc->chosen right away,
>>> so it can skip the iteration.
>>
>> IIUC, We may need some new bpf features if we want to iterating
>> tasks/memcg in BPF, sush as:
>> bpf_for_each_task
>> bpf_for_each_memcg
>> bpf_for_each_task_in_memcg
>> ...
>>
>> It seems we have some work to do first in the BPF side.
>> Will these iterating features be useful in other BPF scenario except OOM
>> Policy?
> 
> Yes.
> Use open coded iterators though.
> Like example in
> https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@fb.com/
> 
> bpf_for_each(task_vma, vma, task, 0) { ... }
> will safely iterate vma-s of the task.
> Similarly struct css_task_iter can be hidden inside bpf open coded iterator.
OK. I think the following APIs whould be useful and I am willing to 
start with these in another bpf-next RFC patchset:

1. bpf_for_each(task). Just like for_each_process(p) in kernel to 
itearing all tasks in the system with rcu_read_lock().

2. bpf_for_each(css_task, task, css). It works like 
css_task_iter_{start, next, end} and would be used to iterating 
tasks/threads under a css.

3. bpf_for_each(descendant_css, css, root_css, {PRE, POST}). It works 
like css_next_descendant_{pre, post} to iterating all descendant.

If you have better ideas or any advice, please let me know.
Thanks.

Alexei Starovoitov Aug. 18, 2023, 4:34 a.m. UTC | #5

On Thu, Aug 17, 2023 at 8:30 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
>
> Hello,
> 在 2023/8/17 11:22, Alexei Starovoitov 写道:
> > On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> >>
> >> Hello,
> >>
> >> 在 2023/8/17 10:07, Alexei Starovoitov 写道:
> >>> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> >>>>    static int oom_evaluate_task(struct task_struct *task, void *arg)
> >>>>    {
> >>>>           struct oom_control *oc = arg;
> >>>> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> >>>>           if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
> >>>>                   goto next;
> >>>>
> >>>> +       /*
> >>>> +        * If task is allocating a lot of memory and has been marked to be
> >>>> +        * killed first if it triggers an oom, then select it.
> >>>> +        */
> >>>> +       if (oom_task_origin(task)) {
> >>>> +               points = LONG_MAX;
> >>>> +               goto select;
> >>>> +       }
> >>>> +
> >>>> +       switch (bpf_oom_evaluate_task(task, oc)) {
> >>>> +       case BPF_EVAL_ABORT:
> >>>> +               goto abort; /* abort search process */
> >>>> +       case BPF_EVAL_NEXT:
> >>>> +               goto next; /* ignore the task */
> >>>> +       case BPF_EVAL_SELECT:
> >>>> +               goto select; /* select the task */
> >>>> +       default:
> >>>> +               break; /* No BPF policy */
> >>>> +       }
> >>>> +
> >>>
> >>> I think forcing bpf prog to look at every task is going to be limiting
> >>> long term.
> >>> It's more flexible to invoke bpf prog from out_of_memory()
> >>> and if it doesn't choose a task then fallback to select_bad_process().
> >>> I believe that's what Roman was proposing.
> >>> bpf can choose to iterate memcg or it might have some side knowledge
> >>> that there are processes that can be set as oc->chosen right away,
> >>> so it can skip the iteration.
> >>
> >> IIUC, We may need some new bpf features if we want to iterating
> >> tasks/memcg in BPF, sush as:
> >> bpf_for_each_task
> >> bpf_for_each_memcg
> >> bpf_for_each_task_in_memcg
> >> ...
> >>
> >> It seems we have some work to do first in the BPF side.
> >> Will these iterating features be useful in other BPF scenario except OOM
> >> Policy?
> >
> > Yes.
> > Use open coded iterators though.
> > Like example in
> > https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@fb.com/
> >
> > bpf_for_each(task_vma, vma, task, 0) { ... }
> > will safely iterate vma-s of the task.
> > Similarly struct css_task_iter can be hidden inside bpf open coded iterator.
> OK. I think the following APIs whould be useful and I am willing to
> start with these in another bpf-next RFC patchset:
>
> 1. bpf_for_each(task). Just like for_each_process(p) in kernel to
> itearing all tasks in the system with rcu_read_lock().
>
> 2. bpf_for_each(css_task, task, css). It works like
> css_task_iter_{start, next, end} and would be used to iterating
> tasks/threads under a css.
>
> 3. bpf_for_each(descendant_css, css, root_css, {PRE, POST}). It works
> like css_next_descendant_{pre, post} to iterating all descendant.
>
> If you have better ideas or any advice, please let me know.

Sounds great. Such 3 new iterators are unrelated to oom discussion and
can be developed/landed in parallel.
They will be useful in other bpf programs.

Michal Hocko Aug. 22, 2023, 10:39 a.m. UTC | #6

On Wed 16-08-23 19:07:10, Alexei Starovoitov wrote:
> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> >  static int oom_evaluate_task(struct task_struct *task, void *arg)
> >  {
> >         struct oom_control *oc = arg;
> > @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> >         if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
> >                 goto next;
> >
> > +       /*
> > +        * If task is allocating a lot of memory and has been marked to be
> > +        * killed first if it triggers an oom, then select it.
> > +        */
> > +       if (oom_task_origin(task)) {
> > +               points = LONG_MAX;
> > +               goto select;
> > +       }
> > +
> > +       switch (bpf_oom_evaluate_task(task, oc)) {
> > +       case BPF_EVAL_ABORT:
> > +               goto abort; /* abort search process */
> > +       case BPF_EVAL_NEXT:
> > +               goto next; /* ignore the task */
> > +       case BPF_EVAL_SELECT:
> > +               goto select; /* select the task */
> > +       default:
> > +               break; /* No BPF policy */
> > +       }
> > +
> 
> I think forcing bpf prog to look at every task is going to be limiting
> long term.
> It's more flexible to invoke bpf prog from out_of_memory()
> and if it doesn't choose a task then fallback to select_bad_process().
> I believe that's what Roman was proposing.
> bpf can choose to iterate memcg or it might have some side knowledge
> that there are processes that can be set as oc->chosen right away,
> so it can skip the iteration.

This is certainly possible but I am worried this will lead to a lot of
duplication. There are common tasks that all/most oom victim selection
implementations should do. First of all they should make sure that the
victim is belonging to the oom domain. Arguably it should be also aware
of ongoing oom victim tear down to prevent from overkilling. Proper oom
victim reference counting handling. Most people are not even aware of
those things. Do we really want all those to be re-invented - most
likely incorrectly?

Advantage of reusing oom_evaluate_task is that all that can be avoided.
Iterating over tasks with a pre-defined oom-victim sure sounds like
unnecessary and wasted CPU cycles but if you want to prevent
over-killing this is still necessary. As the oom killer can be invoked
really rapidly (before the victim has a chance to die) I believe this is
a very useful feature.

Bixuan Cui Sept. 13, 2023, 1:18 a.m. UTC | #7

在 2023/8/10 16:13, Chuyi Zhou 写道:
> +#include <linux/bpf.h> #include <linux/oom.h> #include <linux/mm.h> 
> #include <linux/err.h> @@ -305,6 +306,27 @@ static enum oom_constraint 
> constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } 
> +enum { + NO_BPF_POLICY, + BPF_EVAL_ABORT, + BPF_EVAL_NEXT, + 
> BPF_EVAL_SELECT, +}; +

I saw that tools/testing/selftests/bpf/progs/oom_policy.c is also using 
NO_BPF_POLICY etc. I think
+enum {
+	NO_BPF_POLICY,
+	BPF_EVAL_ABORT,
+	BPF_EVAL_NEXT,
+	BPF_EVAL_SELECT,
+};
+
definitions can be placed in include/linux/oom.h

Thanks
Bixuan Cui

Chuyi Zhou Sept. 13, 2023, 8 a.m. UTC | #8

Hello, Bixuan.

在 2023/9/13 09:18, Bixuan Cui 写道:
> 
> 
> 在 2023/8/10 16:13, Chuyi Zhou 写道:
>> +#include <linux/bpf.h> #include <linux/oom.h> #include <linux/mm.h> 
>> #include <linux/err.h> @@ -305,6 +306,27 @@ static enum oom_constraint 
>> constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } 
>> +enum { + NO_BPF_POLICY, + BPF_EVAL_ABORT, + BPF_EVAL_NEXT, + 
>> BPF_EVAL_SELECT, +}; +
> 
> I saw that tools/testing/selftests/bpf/progs/oom_policy.c is also using 
> NO_BPF_POLICY etc. I think
> +enum {
> +    NO_BPF_POLICY,
> +    BPF_EVAL_ABORT,
> +    BPF_EVAL_NEXT,
> +    BPF_EVAL_SELECT,
> +};
> +
> definitions can be placed in include/linux/oom.h
> 

Thanks for your feedback!

Yes, Maybe we should move these enums to a more proper place so that 
they can be generated by BTF and we can take them from vmlinux.h.

> Thanks
> Bixuan Cui

Bixuan Cui Sept. 13, 2023, 11:24 a.m. UTC | #9

在 2023/8/10 16:13, Chuyi Zhou 写道:
> + +__weak noinline int bpf_oom_evaluate_task(struct task_struct *task, 
> struct oom_control *oc) +{ + return NO_BPF_POLICY; +} + 
> +BTF_SET8_START(oom_bpf_fmodret_ids) +BTF_ID_FLAGS(func, 
> bpf_oom_evaluate_task) +BTF_SET8_END(oom_bpf_fmodret_ids)
I have a question here, why use __weak? Is there other modules that can 
redefine bpf_oom_evaluate_task? why not use __bpf_kfunc 
(Documentation/bpf/kfuncs.rst) ?

Thanks
Bixuan Cui

[RFC,v2,1/5] mm, oom: Introduce bpf_oom_evaluate_task

Checks

Commit Message

Comments

Patch