Message ID | 20230810081319.65668-2-zhouchuyi@bytedance.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | mm: Select victim using bpf_oom_evaluate_task | expand |
Context | Check | Description |
---|---|---|
bpf/vmtest-bpf-next-PR | success | PR summary |
bpf/vmtest-bpf-next-VM_Test-1 | success | Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }} |
bpf/vmtest-bpf-next-VM_Test-2 | success | Logs for ShellCheck |
bpf/vmtest-bpf-next-VM_Test-3 | fail | Logs for build for aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-4 | fail | Logs for build for s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-5 | fail | Logs for build for x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-6 | fail | Logs for build for x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-7 | success | Logs for set-matrix |
bpf/vmtest-bpf-next-VM_Test-8 | success | Logs for veristat |
netdev/tree_selection | success | Not a local patch, async |
On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: > static int oom_evaluate_task(struct task_struct *task, void *arg) > { > struct oom_control *oc = arg; > @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) > if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc)) > goto next; > > + /* > + * If task is allocating a lot of memory and has been marked to be > + * killed first if it triggers an oom, then select it. > + */ > + if (oom_task_origin(task)) { > + points = LONG_MAX; > + goto select; > + } > + > + switch (bpf_oom_evaluate_task(task, oc)) { > + case BPF_EVAL_ABORT: > + goto abort; /* abort search process */ > + case BPF_EVAL_NEXT: > + goto next; /* ignore the task */ > + case BPF_EVAL_SELECT: > + goto select; /* select the task */ > + default: > + break; /* No BPF policy */ > + } > + I think forcing bpf prog to look at every task is going to be limiting long term. It's more flexible to invoke bpf prog from out_of_memory() and if it doesn't choose a task then fallback to select_bad_process(). I believe that's what Roman was proposing. bpf can choose to iterate memcg or it might have some side knowledge that there are processes that can be set as oc->chosen right away, so it can skip the iteration.
Hello, 在 2023/8/17 10:07, Alexei Starovoitov 写道: > On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: >> static int oom_evaluate_task(struct task_struct *task, void *arg) >> { >> struct oom_control *oc = arg; >> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) >> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc)) >> goto next; >> >> + /* >> + * If task is allocating a lot of memory and has been marked to be >> + * killed first if it triggers an oom, then select it. >> + */ >> + if (oom_task_origin(task)) { >> + points = LONG_MAX; >> + goto select; >> + } >> + >> + switch (bpf_oom_evaluate_task(task, oc)) { >> + case BPF_EVAL_ABORT: >> + goto abort; /* abort search process */ >> + case BPF_EVAL_NEXT: >> + goto next; /* ignore the task */ >> + case BPF_EVAL_SELECT: >> + goto select; /* select the task */ >> + default: >> + break; /* No BPF policy */ >> + } >> + > > I think forcing bpf prog to look at every task is going to be limiting > long term. > It's more flexible to invoke bpf prog from out_of_memory() > and if it doesn't choose a task then fallback to select_bad_process(). > I believe that's what Roman was proposing. > bpf can choose to iterate memcg or it might have some side knowledge > that there are processes that can be set as oc->chosen right away, > so it can skip the iteration. IIUC, We may need some new bpf features if we want to iterating tasks/memcg in BPF, sush as: bpf_for_each_task bpf_for_each_memcg bpf_for_each_task_in_memcg ... It seems we have some work to do first in the BPF side. Will these iterating features be useful in other BPF scenario except OOM Policy? Thanks.
On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: > > Hello, > > 在 2023/8/17 10:07, Alexei Starovoitov 写道: > > On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: > >> static int oom_evaluate_task(struct task_struct *task, void *arg) > >> { > >> struct oom_control *oc = arg; > >> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) > >> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc)) > >> goto next; > >> > >> + /* > >> + * If task is allocating a lot of memory and has been marked to be > >> + * killed first if it triggers an oom, then select it. > >> + */ > >> + if (oom_task_origin(task)) { > >> + points = LONG_MAX; > >> + goto select; > >> + } > >> + > >> + switch (bpf_oom_evaluate_task(task, oc)) { > >> + case BPF_EVAL_ABORT: > >> + goto abort; /* abort search process */ > >> + case BPF_EVAL_NEXT: > >> + goto next; /* ignore the task */ > >> + case BPF_EVAL_SELECT: > >> + goto select; /* select the task */ > >> + default: > >> + break; /* No BPF policy */ > >> + } > >> + > > > > I think forcing bpf prog to look at every task is going to be limiting > > long term. > > It's more flexible to invoke bpf prog from out_of_memory() > > and if it doesn't choose a task then fallback to select_bad_process(). > > I believe that's what Roman was proposing. > > bpf can choose to iterate memcg or it might have some side knowledge > > that there are processes that can be set as oc->chosen right away, > > so it can skip the iteration. > > IIUC, We may need some new bpf features if we want to iterating > tasks/memcg in BPF, sush as: > bpf_for_each_task > bpf_for_each_memcg > bpf_for_each_task_in_memcg > ... > > It seems we have some work to do first in the BPF side. > Will these iterating features be useful in other BPF scenario except OOM > Policy? Yes. Use open coded iterators though. Like example in https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@fb.com/ bpf_for_each(task_vma, vma, task, 0) { ... } will safely iterate vma-s of the task. Similarly struct css_task_iter can be hidden inside bpf open coded iterator.
Hello, 在 2023/8/17 11:22, Alexei Starovoitov 写道: > On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: >> >> Hello, >> >> 在 2023/8/17 10:07, Alexei Starovoitov 写道: >>> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: >>>> static int oom_evaluate_task(struct task_struct *task, void *arg) >>>> { >>>> struct oom_control *oc = arg; >>>> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) >>>> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc)) >>>> goto next; >>>> >>>> + /* >>>> + * If task is allocating a lot of memory and has been marked to be >>>> + * killed first if it triggers an oom, then select it. >>>> + */ >>>> + if (oom_task_origin(task)) { >>>> + points = LONG_MAX; >>>> + goto select; >>>> + } >>>> + >>>> + switch (bpf_oom_evaluate_task(task, oc)) { >>>> + case BPF_EVAL_ABORT: >>>> + goto abort; /* abort search process */ >>>> + case BPF_EVAL_NEXT: >>>> + goto next; /* ignore the task */ >>>> + case BPF_EVAL_SELECT: >>>> + goto select; /* select the task */ >>>> + default: >>>> + break; /* No BPF policy */ >>>> + } >>>> + >>> >>> I think forcing bpf prog to look at every task is going to be limiting >>> long term. >>> It's more flexible to invoke bpf prog from out_of_memory() >>> and if it doesn't choose a task then fallback to select_bad_process(). >>> I believe that's what Roman was proposing. >>> bpf can choose to iterate memcg or it might have some side knowledge >>> that there are processes that can be set as oc->chosen right away, >>> so it can skip the iteration. >> >> IIUC, We may need some new bpf features if we want to iterating >> tasks/memcg in BPF, sush as: >> bpf_for_each_task >> bpf_for_each_memcg >> bpf_for_each_task_in_memcg >> ... >> >> It seems we have some work to do first in the BPF side. >> Will these iterating features be useful in other BPF scenario except OOM >> Policy? > > Yes. > Use open coded iterators though. > Like example in > https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@fb.com/ > > bpf_for_each(task_vma, vma, task, 0) { ... } > will safely iterate vma-s of the task. > Similarly struct css_task_iter can be hidden inside bpf open coded iterator. OK. I think the following APIs whould be useful and I am willing to start with these in another bpf-next RFC patchset: 1. bpf_for_each(task). Just like for_each_process(p) in kernel to itearing all tasks in the system with rcu_read_lock(). 2. bpf_for_each(css_task, task, css). It works like css_task_iter_{start, next, end} and would be used to iterating tasks/threads under a css. 3. bpf_for_each(descendant_css, css, root_css, {PRE, POST}). It works like css_next_descendant_{pre, post} to iterating all descendant. If you have better ideas or any advice, please let me know. Thanks.
On Thu, Aug 17, 2023 at 8:30 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: > > Hello, > 在 2023/8/17 11:22, Alexei Starovoitov 写道: > > On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: > >> > >> Hello, > >> > >> 在 2023/8/17 10:07, Alexei Starovoitov 写道: > >>> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: > >>>> static int oom_evaluate_task(struct task_struct *task, void *arg) > >>>> { > >>>> struct oom_control *oc = arg; > >>>> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) > >>>> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc)) > >>>> goto next; > >>>> > >>>> + /* > >>>> + * If task is allocating a lot of memory and has been marked to be > >>>> + * killed first if it triggers an oom, then select it. > >>>> + */ > >>>> + if (oom_task_origin(task)) { > >>>> + points = LONG_MAX; > >>>> + goto select; > >>>> + } > >>>> + > >>>> + switch (bpf_oom_evaluate_task(task, oc)) { > >>>> + case BPF_EVAL_ABORT: > >>>> + goto abort; /* abort search process */ > >>>> + case BPF_EVAL_NEXT: > >>>> + goto next; /* ignore the task */ > >>>> + case BPF_EVAL_SELECT: > >>>> + goto select; /* select the task */ > >>>> + default: > >>>> + break; /* No BPF policy */ > >>>> + } > >>>> + > >>> > >>> I think forcing bpf prog to look at every task is going to be limiting > >>> long term. > >>> It's more flexible to invoke bpf prog from out_of_memory() > >>> and if it doesn't choose a task then fallback to select_bad_process(). > >>> I believe that's what Roman was proposing. > >>> bpf can choose to iterate memcg or it might have some side knowledge > >>> that there are processes that can be set as oc->chosen right away, > >>> so it can skip the iteration. > >> > >> IIUC, We may need some new bpf features if we want to iterating > >> tasks/memcg in BPF, sush as: > >> bpf_for_each_task > >> bpf_for_each_memcg > >> bpf_for_each_task_in_memcg > >> ... > >> > >> It seems we have some work to do first in the BPF side. > >> Will these iterating features be useful in other BPF scenario except OOM > >> Policy? > > > > Yes. > > Use open coded iterators though. > > Like example in > > https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@fb.com/ > > > > bpf_for_each(task_vma, vma, task, 0) { ... } > > will safely iterate vma-s of the task. > > Similarly struct css_task_iter can be hidden inside bpf open coded iterator. > OK. I think the following APIs whould be useful and I am willing to > start with these in another bpf-next RFC patchset: > > 1. bpf_for_each(task). Just like for_each_process(p) in kernel to > itearing all tasks in the system with rcu_read_lock(). > > 2. bpf_for_each(css_task, task, css). It works like > css_task_iter_{start, next, end} and would be used to iterating > tasks/threads under a css. > > 3. bpf_for_each(descendant_css, css, root_css, {PRE, POST}). It works > like css_next_descendant_{pre, post} to iterating all descendant. > > If you have better ideas or any advice, please let me know. Sounds great. Such 3 new iterators are unrelated to oom discussion and can be developed/landed in parallel. They will be useful in other bpf programs.
On Wed 16-08-23 19:07:10, Alexei Starovoitov wrote: > On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: > > static int oom_evaluate_task(struct task_struct *task, void *arg) > > { > > struct oom_control *oc = arg; > > @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) > > if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc)) > > goto next; > > > > + /* > > + * If task is allocating a lot of memory and has been marked to be > > + * killed first if it triggers an oom, then select it. > > + */ > > + if (oom_task_origin(task)) { > > + points = LONG_MAX; > > + goto select; > > + } > > + > > + switch (bpf_oom_evaluate_task(task, oc)) { > > + case BPF_EVAL_ABORT: > > + goto abort; /* abort search process */ > > + case BPF_EVAL_NEXT: > > + goto next; /* ignore the task */ > > + case BPF_EVAL_SELECT: > > + goto select; /* select the task */ > > + default: > > + break; /* No BPF policy */ > > + } > > + > > I think forcing bpf prog to look at every task is going to be limiting > long term. > It's more flexible to invoke bpf prog from out_of_memory() > and if it doesn't choose a task then fallback to select_bad_process(). > I believe that's what Roman was proposing. > bpf can choose to iterate memcg or it might have some side knowledge > that there are processes that can be set as oc->chosen right away, > so it can skip the iteration. This is certainly possible but I am worried this will lead to a lot of duplication. There are common tasks that all/most oom victim selection implementations should do. First of all they should make sure that the victim is belonging to the oom domain. Arguably it should be also aware of ongoing oom victim tear down to prevent from overkilling. Proper oom victim reference counting handling. Most people are not even aware of those things. Do we really want all those to be re-invented - most likely incorrectly? Advantage of reusing oom_evaluate_task is that all that can be avoided. Iterating over tasks with a pre-defined oom-victim sure sounds like unnecessary and wasted CPU cycles but if you want to prevent over-killing this is still necessary. As the oom killer can be invoked really rapidly (before the victim has a chance to die) I believe this is a very useful feature.
在 2023/8/10 16:13, Chuyi Zhou 写道: > +#include <linux/bpf.h> #include <linux/oom.h> #include <linux/mm.h> > #include <linux/err.h> @@ -305,6 +306,27 @@ static enum oom_constraint > constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } > +enum { + NO_BPF_POLICY, + BPF_EVAL_ABORT, + BPF_EVAL_NEXT, + > BPF_EVAL_SELECT, +}; + I saw that tools/testing/selftests/bpf/progs/oom_policy.c is also using NO_BPF_POLICY etc. I think +enum { + NO_BPF_POLICY, + BPF_EVAL_ABORT, + BPF_EVAL_NEXT, + BPF_EVAL_SELECT, +}; + definitions can be placed in include/linux/oom.h Thanks Bixuan Cui
Hello, Bixuan. 在 2023/9/13 09:18, Bixuan Cui 写道: > > > 在 2023/8/10 16:13, Chuyi Zhou 写道: >> +#include <linux/bpf.h> #include <linux/oom.h> #include <linux/mm.h> >> #include <linux/err.h> @@ -305,6 +306,27 @@ static enum oom_constraint >> constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } >> +enum { + NO_BPF_POLICY, + BPF_EVAL_ABORT, + BPF_EVAL_NEXT, + >> BPF_EVAL_SELECT, +}; + > > I saw that tools/testing/selftests/bpf/progs/oom_policy.c is also using > NO_BPF_POLICY etc. I think > +enum { > + NO_BPF_POLICY, > + BPF_EVAL_ABORT, > + BPF_EVAL_NEXT, > + BPF_EVAL_SELECT, > +}; > + > definitions can be placed in include/linux/oom.h > Thanks for your feedback! Yes, Maybe we should move these enums to a more proper place so that they can be generated by BTF and we can take them from vmlinux.h. > Thanks > Bixuan Cui
在 2023/8/10 16:13, Chuyi Zhou 写道: > + +__weak noinline int bpf_oom_evaluate_task(struct task_struct *task, > struct oom_control *oc) +{ + return NO_BPF_POLICY; +} + > +BTF_SET8_START(oom_bpf_fmodret_ids) +BTF_ID_FLAGS(func, > bpf_oom_evaluate_task) +BTF_SET8_END(oom_bpf_fmodret_ids) I have a question here, why use __weak? Is there other modules that can redefine bpf_oom_evaluate_task? why not use __bpf_kfunc (Documentation/bpf/kfuncs.rst) ? Thanks Bixuan Cui
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 612b5597d3af..255c9ef1d808 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -18,6 +18,7 @@ * kernel subsystems and hints as to where to find out what things do. */ +#include <linux/bpf.h> #include <linux/oom.h> #include <linux/mm.h> #include <linux/err.h> @@ -305,6 +306,27 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } +enum { + NO_BPF_POLICY, + BPF_EVAL_ABORT, + BPF_EVAL_NEXT, + BPF_EVAL_SELECT, +}; + +__weak noinline int bpf_oom_evaluate_task(struct task_struct *task, struct oom_control *oc) +{ + return NO_BPF_POLICY; +} + +BTF_SET8_START(oom_bpf_fmodret_ids) +BTF_ID_FLAGS(func, bpf_oom_evaluate_task) +BTF_SET8_END(oom_bpf_fmodret_ids) + +static const struct btf_kfunc_id_set oom_bpf_fmodret_set = { + .owner = THIS_MODULE, + .set = &oom_bpf_fmodret_ids, +}; + static int oom_evaluate_task(struct task_struct *task, void *arg) { struct oom_control *oc = arg; @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc)) goto next; + /* + * If task is allocating a lot of memory and has been marked to be + * killed first if it triggers an oom, then select it. + */ + if (oom_task_origin(task)) { + points = LONG_MAX; + goto select; + } + + switch (bpf_oom_evaluate_task(task, oc)) { + case BPF_EVAL_ABORT: + goto abort; /* abort search process */ + case BPF_EVAL_NEXT: + goto next; /* ignore the task */ + case BPF_EVAL_SELECT: + goto select; /* select the task */ + default: + break; /* No BPF policy */ + } + /* * This task already has access to memory reserves and is being killed. * Don't allow any other task to have access to the reserves unless @@ -329,15 +371,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) goto abort; } - /* - * If task is allocating a lot of memory and has been marked to be - * killed first if it triggers an oom, then select it. - */ - if (oom_task_origin(task)) { - points = LONG_MAX; - goto select; - } - points = oom_badness(task, oc->totalpages); if (points == LONG_MIN || points < oc->chosen_points) goto next; @@ -732,10 +765,18 @@ static struct ctl_table vm_oom_kill_table[] = { static int __init oom_init(void) { + int err; oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); #ifdef CONFIG_SYSCTL register_sysctl_init("vm", vm_oom_kill_table); #endif + +#ifdef CONFIG_BPF_SYSCALL + err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set); + if (err) + pr_warn("error while registering oom fmodret entrypoints: %d", err); +#endif + return 0; } subsys_initcall(oom_init)
This patch adds a new hook bpf_oom_evaluate_task in oom_evaluate_task. It takes oc and current iterating task as parameters and returns a result indicating which one should be selected. We can use it to bypass the current logic of oom_evaluate_task and implement customized OOM policies in the attached BPF progams. Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> --- mm/oom_kill.c | 59 +++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 50 insertions(+), 9 deletions(-)