From patchwork Tue Apr 11 06:58:15 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gang Li X-Patchwork-Id: 13207124 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 457A9C7619A for ; Tue, 11 Apr 2023 06:58:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C64876B00AF; Tue, 11 Apr 2023 02:58:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C13D0280059; Tue, 11 Apr 2023 02:58:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AB4306B00B1; Tue, 11 Apr 2023 02:58:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 945EC6B00AF for ; Tue, 11 Apr 2023 02:58:28 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 5C0E640574 for ; Tue, 11 Apr 2023 06:58:28 +0000 (UTC) X-FDA: 80668206696.30.DDDF7B4 Received: from mail-pf1-f170.google.com (mail-pf1-f170.google.com [209.85.210.170]) by imf17.hostedemail.com (Postfix) with ESMTP id AB13840016 for ; Tue, 11 Apr 2023 06:58:25 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=U9X1HyoD; spf=pass (imf17.hostedemail.com: domain of ligang.bdlg@bytedance.com designates 209.85.210.170 as permitted sender) smtp.mailfrom=ligang.bdlg@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681196306; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=ncPE3HgOzt6ii2dxf/TVQgxDoOPzo9kXvSGxZ1YSrno=; b=qWnZfUdWs/FxsxDpdpoc2+ZySGI33xuIyHhTvfO12qsQO0Yx29gPQsp4/VRP4J506QjeDm DJnZG4/RTyKLyrmXHkljMXvh7r4JIDo3c1gsRywDR56FQ9QYKM5Oo5AJfYY6b6mYPilLW2 bNeFzm4KHbTs9QVRtjnpQRBZcWykGMA= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=U9X1HyoD; spf=pass (imf17.hostedemail.com: domain of ligang.bdlg@bytedance.com designates 209.85.210.170 as permitted sender) smtp.mailfrom=ligang.bdlg@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681196306; a=rsa-sha256; cv=none; b=19k0p6Yd5kSUThMT7e4YRKbjIHuF3j9aIDKtW7LmBKWHeur4YUWQgZzBy3qYHszt+UTpl0 ITP89lQ7qgpmz1yhb5IrNiYLWv32WYdojuK34KKwnaLE5j7azuofgukmzXKfDLNbJNTYWf HSFFAoKUBVpT5j9Ke+qf4ZXZDaXpb+0= Received: by mail-pf1-f170.google.com with SMTP id d2e1a72fcca58-632bcd6cb63so687668b3a.0 for ; Mon, 10 Apr 2023 23:58:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1681196304; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=ncPE3HgOzt6ii2dxf/TVQgxDoOPzo9kXvSGxZ1YSrno=; b=U9X1HyoD7KFf5kQSoltVmCTUwDifCKSDJ0Mq0NJRRY5a7vfRwuuqxOqiKAry3M/XOW zOFfNH7gfx1o120mDIzdrqkDD+wElOfP+3vKuPp2AEiHAMcgpxbi2Epvj8KiZSK0qfpP 9otCPc1QlqpuWD6tyIqym2Jg+0lo99z1DJK6Q8nYpDJ+XFPgosQ7QCEjkZ6HwYUJsC+O Xx37gfiRuOJmqVjUYVESzD+eOkJu1JTR0/U7CGTzOodJk0NOSKPY3jHb345oz6p6fOxn ckQlGs173omtz27f+C2EodDi3+krjrO1E9XxZlbA5pOCZtCJMimr7NPIqcgOH7rghnh7 uE+g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1681196304; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ncPE3HgOzt6ii2dxf/TVQgxDoOPzo9kXvSGxZ1YSrno=; b=G8xOUGSsWNXsbQen+1d731ns9C7HNvOXl6GIdO+pBV8yWyDrVhctvw2l9Az/jvsnkF or791smAzQl6ZDCBszN6vorxsUsH/mmv1GnCUkamiGfYyEMNHvGblFk6QJpCLrAy63xZ DOuuI4JUZzHbZY8ekGLV6NhWzyZk4Cq5pMd5UgbF3XBA80LjS8pUoW7iQJiD3uJBx8N8 kcTy/AjuHLjUqzzMFS01Hmykdkj7W1EAh31qAb6ncbW7zb0yrHb0330OevRfOpuIz+Ca JtHK5u4fy+17WxlSrVp8BxCBTA1yjCt6QyWGT177X26dn4f8s/MvjDjeAf1+Mx6k88NV EBSw== X-Gm-Message-State: AAQBX9dxHP0kSwrk5XiPhhwhLdwOeBszjCm6O8LXpgpGYUpqyWOu/42y 53HWADsfITiqQCs8ikSwxwz7cQ== X-Google-Smtp-Source: AKy350at+azI+kWGdlkXmI4bpDXWT0RsYeV7z7NshkuWqXGxrTXNHokuIbpy+pF+zzCKIWWegENK8A== X-Received: by 2002:a62:1a8d:0:b0:636:f899:46a0 with SMTP id a135-20020a621a8d000000b00636f89946a0mr6340939pfa.15.1681196304384; Mon, 10 Apr 2023 23:58:24 -0700 (PDT) Received: from C02FT5A6MD6R.bytedance.net ([61.213.176.7]) by smtp.gmail.com with ESMTPSA id 188-20020a6305c5000000b005186e562db0sm4247682pgf.82.2023.04.10.23.58.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Apr 2023 23:58:23 -0700 (PDT) From: Gang Li To: Waiman Long , Michal Hocko Cc: Gang Li , cgroups@vger.kernel.org, linux-mm@kvack.org, rientjes@google.com, Zefan Li , linux-kernel@vger.kernel.org Subject: [PATCH v4] mm: oom: introduce cpuset oom Date: Tue, 11 Apr 2023 14:58:15 +0800 Message-Id: <20230411065816.9798-1-ligang.bdlg@bytedance.com> X-Mailer: git-send-email 2.32.0 MIME-Version: 1.0 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: AB13840016 X-Rspam-User: X-Stat-Signature: 7zb1esc76ron739bi4euxzx7ow4czymn X-HE-Tag: 1681196305-535672 X-HE-Meta: U2FsdGVkX1/lmQqZuPvlpypmrdF3vb/KCCO/sMfsi5YjhY48LGMOHXmnD6P5wC5uppBMDlSYureh2e94mrnru7kjMSaijr9zk7gycr/CsjvjZK8KarBfypN0SY8ntBxyxBbYbP+FauTMY372FG6wnw9Q6RF+fRNacRRIvbiBQkspjah/CP/B2eTLI/z/bHYfNbJ3sCWTk04bIn+66VVovLWZHNXPdl/x/xrlIKWJ1QktpvYeQSSl0x8UvAtfHInnUDs4iKDJgN4smi7opDg6CzW+B+l+fMlV5kL8SLpFZGywZN/f8jNXquDZEux/yPgUq/RaGQzMYVDjjFBqKzawVKoiGe2+7Ug4LVtTIpiCB/Oou6pdlzQunQYy/L6e9OcTUoclrPNXBWC9m9pMGQew5t0kVPJWnnGI1WTH6nBZXFfeB4MNXr4TBmVu41pjRsaR0GlYVpAYDilgSPWDvBfAwhdhStLBvDu5vRqOGnTpaVC8yKPLGeBuPk3aOORM8Uq+W42ZythbY9WRChoCQqLRRtkBRGJkI9HlqWtDWTsWJP2DZwTn0m2IXxqlaT/SNTt3I8+2A2KUf4SfZnttxRUI4/xy6JH7298YQSq/7Ulcjx+TWjJ0OXBfvfMq2nHz02xsGKr6wvbA/Sj1NR5WmtqcTrHD5BI+enx3uwL/bGq8w9hiIH8e6PUmj0WcDQLECpWL3PDqJ6TXw9bVwffJyTie5DC4vzXb17/1bImLlnKB4mQag5LswTbrLpNiFcE5GV1TxWlAsTXgx6iCh//Mb+8nfaG2nKqh24kazCoRVk/mVROzvqBYCFcboHIKkDB2Vu0LSpdvgU6JP684AFHQDMsvLHEsRdF/XgEG9cHt7oJ1HgBzf/BscOplHWsa8OrBRtgHxCZbVJNbIAoZ99GbqQWosY+hjWYnmkAw4tHCFXzWgoHTN0gaLuz92pgHV9drAUp5dUrUMzNQM0myl97L2bb fjkJMhT9 wP2kdFwvG5q9LBGLNoHO2kgRBqTI9dU14zVKOj8+ukXirAwq4RQntXMcbq1OL54VMwuf8NtCtZ8UWr6xQlHEtaBhjOy0wvOZUP5B0pHqCxgrOkTtT0GsUMxyiUHNYctmc53EmYsXY8wcZln2hpgU3UgHpmrSAY8SSiRea1GeOa02OiAPMMjtqpC+FcR7/WsYqXzBJaooHMHw/gbPdv+aj4pGrFC0IXDOiVDeBL+K/TKuhHX4PVHi097FNfpmV8z2B7okrufbBhViipgGJaZLhwuxLW/5cu+bE5s4eGyUOCMZRaAYymvjVo2ag9OMPC1Yu6Wxf3B/5hVHL2lbgw+NJNHYZmLsDWJXtw57EEeLe8GmfW2k9lPwoEGUB25i9QBqwZTE0jA57gKtzY6ta5SKie6YXRirPYGkGdS+AOBmRnt2Lu5plm7xUt1cgL/5I83NZ19rg45e6BRlpiwjj3Iv1I1Dl3jVfgUTFR6vCEsnr/mIUmqYsAy8qVEbY38pUvu7L9VpwOI1Xja0+Ar5K6Rqg0Xl4OEkNwYvxtpAIfj/sjy59G0ElvO0A2EXpJrIOPlu/piPnDYzQ8KZDgaXuQGWxXKqBArNxOUfFR/MaqDR07rTLoXvZq345Y2l2LsxtZlCvQKQ3U8qmDizUyUhPGKWMsC+GvFdngZq3g9mth1S0xi7VbeqvWjHpjs9WqTjrOIyuYi/pPRfaSuErhUxJMCJYITJbyeHw92ylPXQ4sz7viSsycvb0mc9CNuhvtXrhaUUFC+VJ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Cpusets constrain the CPU and Memory placement of tasks. `CONSTRAINT_CPUSET` type in oom has existed for a long time, but has never been utilized. When a process in cpuset which constrain memory placement triggers oom, it may kill a completely irrelevant process on other numa nodes, which will not release any memory for this cpuset. We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and selecting victim from cpusets with the same mems_allowed as the current one. Example: Create two processes named mem_on_node0 and mem_on_node1 constrained by cpusets respectively. These two processes alloc memory on their own node. Now node0 has run out of memory, OOM will be invokled by mem_on_node0. Before this patch: Since `CONSTRAINT_CPUSET` do nothing, the victim will be selected from the entire system. Therefore, the OOM is highly likely to kill mem_on_node1, which will not free any memory for mem_on_node0. This is a useless kill. ``` [ 2786.519080] mem_on_node0 invoked oom-killer [ 2786.885738] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 2787.181724] [ 13432] 0 13432 787016 786745 6344704 0 0 mem_on_node1 [ 2787.189115] [ 13457] 0 13457 787002 785504 6340608 0 0 mem_on_node0 [ 2787.216534] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0 [ 2787.229991] Out of memory: Killed process 13432 (mem_on_node1) ``` After this patch: The victim will be selected only in all cpusets that have the same mems_allowed as the cpuset that invoked oom. This will prevent useless kill and protect innocent victims. ``` [ 395.922444] mem_on_node0 invoked oom-killer [ 396.239777] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 396.246128] [ 2614] 0 2614 1311294 1144192 9224192 0 0 mem_on_node0 [ 396.252655] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0 [ 396.264068] Out of memory: Killed process 2614 (mem_on_node0) ``` Suggested-by: Michal Hocko Cc: Cc: Cc: Cc: Waiman Long Cc: Zefan Li Signed-off-by: Gang Li --- Changes in v4: - Modify comments and documentation. Changes in v3: - https://lore.kernel.org/all/20230410025056.22103-1-ligang.bdlg@bytedance.com/ - Provide more details about the use case, testing, implementation. - Document the userspace visible change in Documentation. - Rename cpuset_cgroup_scan_tasks() to cpuset_scan_tasks() and add a doctext comment about its purpose and how it should be used. - Take cpuset_rwsem to ensure that cpusets are stable. Changes in v2: - https://lore.kernel.org/all/20230404115509.14299-1-ligang.bdlg@bytedance.com/ - Select victim from all cpusets with the same mems_allowed as the current cpuset. v1: - https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/ - Introduce cpuset oom. --- .../admin-guide/cgroup-v1/cpusets.rst | 16 ++++++- Documentation/admin-guide/cgroup-v2.rst | 4 ++ include/linux/cpuset.h | 6 +++ kernel/cgroup/cpuset.c | 43 +++++++++++++++++++ mm/oom_kill.c | 4 ++ 5 files changed, 71 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index 5d844ed4df69..51ffdc0eb167 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -25,7 +25,8 @@ Written by Simon.Derr@bull.net 1.6 What is memory spread ? 1.7 What is sched_load_balance ? 1.8 What is sched_relax_domain_level ? - 1.9 How do I use cpusets ? + 1.9 What is cpuset oom ? + 1.10 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Adding/removing cpus @@ -607,8 +608,19 @@ If your situation is: - The latency is required even it sacrifices cache hit rate etc. then increasing 'sched_relax_domain_level' would benefit you. +1.9 What is cpuset oom ? +-------------------------- +If there is no available memory to allocate on the nodes specified by +cpuset.mems, then an OOM (Out-Of-Memory) will be invoked. + +Since the victim selection is a heuristic algorithm, we cannot select +the "perfect" victim. Therefore, currently, the victim will be selected +from all the cpusets that have the same mems_allowed as the cpuset +which invoked OOM. + +Cpuset oom works in both cgroup v1 and v2. -1.9 How do I use cpusets ? +1.10 How do I use cpusets ? -------------------------- In order to minimize the impact of cpusets on critical kernel diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index f67c0829350b..594aa71cf441 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2199,6 +2199,10 @@ Cpuset Interface Files a need to change "cpuset.mems" with active tasks, it shouldn't be done frequently. + When a process invokes oom due to the constraint of cpuset.mems, + the victim will be selected from cpusets with the same + mems_allowed as the current one. + cpuset.mems.effective A read-only multiple values file which exists on all cpuset-enabled cgroups. diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 980b76a1237e..75465bf58f74 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } +int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg); + #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) return false; } +static inline int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) +{ + return 0; +} #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index bc4dcfd7bee5..cb6b49245e18 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4013,6 +4013,49 @@ void cpuset_print_current_mems_allowed(void) rcu_read_unlock(); } +/** + * cpuset_scan_tasks - specify the oom scan range + * @fn: callback function to select oom victim + * @arg: argument for callback function, usually a pointer to struct oom_control + * + * Description: This function is used to specify the oom scan range. Return 0 if + * no task is selected, otherwise return 1. The selected task will be stored in + * arg->chosen. This function can only be called in cpuset oom context. + * + * The selection algorithm is heuristic, therefore requires constant iteration + * based on user feedback. Currently, we just iterate through all cpusets with + * the same mems_allowed as the current cpuset. + */ +int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) +{ + int ret = 0; + struct css_task_iter it; + struct task_struct *task; + struct cpuset *cs; + struct cgroup_subsys_state *pos_css; + + /* + * Situation gets complex with overlapping nodemasks in different cpusets. + * TODO: Maybe we should calculate the "distance" between different mems_allowed. + * + * But for now, let's make it simple. Just iterate through all cpusets + * with the same mems_allowed as the current cpuset. + */ + cpuset_read_lock(); + rcu_read_lock(); + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) { + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it); + while (!ret && (task = css_task_iter_next(&it))) + ret = fn(task, arg); + css_task_iter_end(&it); + } + } + rcu_read_unlock(); + cpuset_read_unlock(); + return ret; +} + /* * Collection of memory_pressure is suppressed unless * this flag is enabled by writing "1" to the special diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 044e1eed720e..228257788d9e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -367,6 +367,8 @@ static void select_bad_process(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); + else if (oc->constraint == CONSTRAINT_CPUSET) + cpuset_scan_tasks(oom_evaluate_task, oc); else { struct task_struct *p; @@ -427,6 +429,8 @@ static void dump_tasks(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, dump_task, oc); + else if (oc->constraint == CONSTRAINT_CPUSET) + cpuset_scan_tasks(dump_task, oc); else { struct task_struct *p;