diff mbox series

[RFC] Android OOM helper proof of concept

Message ID e01ee3aa-cbcb-dae2-ebcc-aba8b01d8aef@sony.com (mailing list archive)
State New, archived
Headers show
Series [RFC] Android OOM helper proof of concept | expand

Commit Message

Peter Enderborg April 22, 2021, 12:33 p.m. UTC
On 4/21/21 4:29 PM, Michal Hocko wrote:
> On Wed 21-04-21 06:57:43, Shakeel Butt wrote:
>> On Wed, Apr 21, 2021 at 12:16 AM Michal Hocko <mhocko@suse.com> wrote:
>> [...]
>>>> To decide when to kill, the oom-killer has to read a lot of metrics.
>>>> It has to open a lot of files to read them and there will definitely
>>>> be new allocations involved in those operations. For example reading
>>>> memory.stat does a page size allocation. Similarly, to perform action
>>>> the oom-killer may have to read cgroup.procs file which again has
>>>> allocation inside it.
>>> True but many of those can be avoided by opening the file early. At
>>> least seq_file based ones will not allocate later if the output size
>>> doesn't increase. Which should be the case for many. I think it is a
>>> general improvement to push those who allocate during read to an open
>>> time allocation.
>>>
>> I agree that this would be a general improvement but it is not always
>> possible (see below).
> It would be still great to invest into those improvements. And I would
> be really grateful to learn about bottlenecks from the existing kernel
> interfaces you have found on the way.
>
>>>> Regarding sophisticated oom policy, I can give one example of our
>>>> cluster level policy. For robustness, many user facing jobs run a lot
>>>> of instances in a cluster to handle failures. Such jobs are tolerant
>>>> to some amount of failures but they still have requirements to not let
>>>> the number of running instances below some threshold. Normally killing
>>>> such jobs is fine but we do want to make sure that we do not violate
>>>> their cluster level agreement. So, the userspace oom-killer may
>>>> dynamically need to confirm if such a job can be killed.
>>> What kind of data do you need to examine to make those decisions?
>>>
>> Most of the time the cluster level scheduler pushes the information to
>> the node controller which transfers that information to the
>> oom-killer. However based on the freshness of the information the
>> oom-killer might request to pull the latest information (IPC and RPC).
> I cannot imagine any OOM handler to be reliable if it has to depend on
> other userspace component with a lower resource priority. OOM handlers
> are fundamentally complex components which has to reduce their
> dependencies to the bare minimum.


I think we very much need a OOM killer that can help out,
but it is essential that it also play with android rules.

This is RFC patch that interact with OOM

From 09f3a2e401d4ed77e95b7cea7edb7c5c3e6a0c62 Mon Sep 17 00:00:00 2001
From: Peter Enderborg <peter.enderborg@sony.com>
Date: Thu, 22 Apr 2021 14:15:46 +0200
Subject: [PATCH] mm/oom: Android oomhelper

This is proff of concept of a pre-oom-killer that kill task
strictly on oom-score-adj order if the score is positive.

It act as lifeline when userspace does not have optimal performance.
---
 drivers/staging/Makefile              |  1 +
 drivers/staging/oomhelper/Makefile    |  2 +
 drivers/staging/oomhelper/oomhelper.c | 65 +++++++++++++++++++++++++++
 mm/oom_kill.c                         |  4 +-
 4 files changed, 70 insertions(+), 2 deletions(-)
 create mode 100644 drivers/staging/oomhelper/Makefile
 create mode 100644 drivers/staging/oomhelper/oomhelper.c

Comments

Michal Hocko April 22, 2021, 1:03 p.m. UTC | #1
On Thu 22-04-21 14:33:45, peter enderborg wrote:
[...]
> I think we very much need a OOM killer that can help out,
> but it is essential that it also play with android rules.

This is completely off topic to the discussion here.
diff mbox series

Patch

diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile
index 2245059e69c7..4a5449b42568 100644
--- a/drivers/staging/Makefile
+++ b/drivers/staging/Makefile
@@ -47,3 +47,4 @@  obj-$(CONFIG_QLGE)        += qlge/
 obj-$(CONFIG_WIMAX)        += wimax/
 obj-$(CONFIG_WFX)        += wfx/
 obj-y                += hikey9xx/
+obj-y                += oomhelper/
diff --git a/drivers/staging/oomhelper/Makefile b/drivers/staging/oomhelper/Makefile
new file mode 100644
index 000000000000..ee9b361957f8
--- /dev/null
+++ b/drivers/staging/oomhelper/Makefile
@@ -0,0 +1,2 @@ 
+# SPDX-License-Identifier: GPL-2.0
+obj-y    += oomhelper.o
diff --git a/drivers/staging/oomhelper/oomhelper.c b/drivers/staging/oomhelper/oomhelper.c
new file mode 100644
index 000000000000..5a3fe0270cb8
--- /dev/null
+++ b/drivers/staging/oomhelper/oomhelper.c
@@ -0,0 +1,65 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* prof of concept of android aware oom killer */
+/* Author: peter.enderborg@sony.com */
+
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/oom.h>
+void wake_oom_reaper(struct task_struct *tsk); /* need to public ... */
+void __oom_kill_process(struct task_struct *victim, const char *message);
+
+static int oomhelper_oom_notify(struct notifier_block *self,
+                      unsigned long notused, void *param)
+{
+  struct task_struct *tsk;
+  struct task_struct *selected = NULL;
+  int highest = 0;
+
+  pr_info("invited");
+  rcu_read_lock();
+  for_each_process(tsk) {
+      struct task_struct *candidate;
+      if (tsk->flags & PF_KTHREAD)
+          continue;
+
+      /* Ignore task if coredump in progress */
+      if (tsk->mm && tsk->mm->core_state)
+          continue;
+      candidate = find_lock_task_mm(tsk);
+      if (!candidate)
+          continue;
+
+      if (highest < candidate->signal->oom_score_adj) {
+          /* for test dont kill level 0 */
+          highest = candidate->signal->oom_score_adj;
+          selected = candidate;
+          pr_info("new selected %d %d", selected->pid,
+              selected->signal->oom_score_adj);
+      }
+      task_unlock(candidate);
+  }
+  if (selected) {
+      get_task_struct(selected);
+  }
+  rcu_read_unlock();
+  if (selected) {
+      pr_info("oomhelper killing: %d", selected->pid);
+      __oom_kill_process(selected, "oomhelper");
+  }
+
+  return NOTIFY_OK;
+}
+
+static struct notifier_block oomhelper_oom_nb = {
+    .notifier_call = oomhelper_oom_notify
+};
+
+int __init oomhelper_register_oom_notifier(void)
+{
+    register_oom_notifier(&oomhelper_oom_nb);
+    pr_info("oomhelper installed");
+    return 0;
+}
+
+subsys_initcall(oomhelper_register_oom_notifier);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index fa1cf18bac97..a5f7299af9a3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -658,7 +658,7 @@  static int oom_reaper(void *unused)
     return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
     /* mm is already queued? */
     if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
@@ -856,7 +856,7 @@  static bool task_will_free_mem(struct task_struct *task)
     return ret;
 }
 
-static void __oom_kill_process(struct task_struct *victim, const char *message)
+void __oom_kill_process(struct task_struct *victim, const char *message)
 {
     struct task_struct *p;
     struct mm_struct *mm;