From patchwork Thu Jul 27 07:36:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chuyi Zhou X-Patchwork-Id: 13329083 X-Patchwork-Delegate: bpf@iogearbox.net Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5DF06BE72 for ; Thu, 27 Jul 2023 07:37:41 +0000 (UTC) Received: from mail-ot1-x32a.google.com (mail-ot1-x32a.google.com [IPv6:2607:f8b0:4864:20::32a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 68BC98689 for ; Thu, 27 Jul 2023 00:37:13 -0700 (PDT) Received: by mail-ot1-x32a.google.com with SMTP id 46e09a7af769-6b9ede6195cso573564a34.3 for ; Thu, 27 Jul 2023 00:37:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1690443432; x=1691048232; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=GFgH3qRamcsfut4sZg8cGFh78nFJgYUtbeqMHlI/23M=; b=dHR9kmWmcQ9le6M7YmaPbGgzg8LU1j88y2o2q+oGRZ09Fjo75XD675sOOoB3fiL7S/ mvJy/2hkUTJRAHi6RATqiBH6JYJsnMtW6BK0jXCyn0bBEKPWYUF1jDM3izq7+igh1ObQ Y9kqSsKfoASH3X3PRcbfgEXdnXZOLsODsIA8MS/dvlAWCNQk5GD6lwZUKjQpttwZvN2o zmgFhVSWqHDWZCO9dNxMtAI2ybRcdpb8qFZsydXSWI3EN7l3qgaodQMhiNjXu62rcr4p YGPDZRKOb2HsGrtEJAIlVBHUd0ED8g4LSMQnivwL3n8AQr1iacPB/KkD20p7/XffTHFO 79Zg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690443432; x=1691048232; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GFgH3qRamcsfut4sZg8cGFh78nFJgYUtbeqMHlI/23M=; b=M1jAi4iZgmtMbg3q2a0F9LMxZy73LrLJrSCaP6llyN4OYUNlD1fMWhgvDJVCZIAb/u /4u+Rj0PNVoHh0Tx1KTD3yEHlIqzcTBtCAzGwqgBqy+UiHlZwjipfIWrJKnVr/j/VY0R z/M5EWsKnB1GJaC8zG75w48wiKKK1LrJAenjtGvW77Gq4nbszvzW96Dn6nG2DxRvha+W nYKuAhkKFc+HE3gEE4PYRspPYowjV6ss6YERPxCj86yJV1eKypkfAWzHIVWvx3yH5qkv T8jKyYmHbpKBnI6oStb3m1e361HPiE44g/EEDwQmhcGBSA+LggZ3EbQhJDUgEO9GpNpY +oOQ== X-Gm-Message-State: ABy/qLbe7lbf0aZI7ETe3MFQUWEGL5SB4UGFcuNbD480sCTMfQ5G7WJp qZrIEP9nogqkMMBnOUNP0pSsscNKsKnKmB7JsIyQyA== X-Google-Smtp-Source: APBJJlGUvwqFf9DYbCK3bVxj8P5XX4xs3RgcKahgwp5EZ9OT6YVjzzc8PVOv0eTLfpRrgsLixYnwjw== X-Received: by 2002:a05:6358:9328:b0:135:4003:784c with SMTP id x40-20020a056358932800b001354003784cmr1695218rwa.17.1690443432576; Thu, 27 Jul 2023 00:37:12 -0700 (PDT) Received: from n37-019-243.byted.org ([180.184.51.134]) by smtp.gmail.com with ESMTPSA id s196-20020a6377cd000000b005638a70110bsm733919pgc.65.2023.07.27.00.37.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Jul 2023 00:37:12 -0700 (PDT) From: Chuyi Zhou To: hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org Cc: bpf@vger.kernel.org, linux-kernel@vger.kernel.org, wuyun.abel@bytedance.com, robin.lu@bytedance.com, Chuyi Zhou Subject: [RFC PATCH 1/5] bpf: Introduce BPF_PROG_TYPE_OOM_POLICY Date: Thu, 27 Jul 2023 15:36:28 +0800 Message-Id: <20230727073632.44983-2-zhouchuyi@bytedance.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20230727073632.44983-1-zhouchuyi@bytedance.com> References: <20230727073632.44983-1-zhouchuyi@bytedance.com> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC This patch introduces a BPF_PROG_TYPE_OOM_POLICY program type. This prog will be used to select a leaf memcg as victim from the memcg tree when global oom is invoked. The program takes two sibling cgroup's id as parameters and return a comparison result indicating which one should be chosen as the victim. Suggested-by: Abel Wu Signed-off-by: Chuyi Zhou --- include/linux/bpf_oom.h | 22 +++++ include/linux/bpf_types.h | 2 + include/uapi/linux/bpf.h | 14 ++++ kernel/bpf/syscall.c | 10 +++ mm/oom_kill.c | 168 ++++++++++++++++++++++++++++++++++++++ 5 files changed, 216 insertions(+) create mode 100644 include/linux/bpf_oom.h diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h new file mode 100644 index 000000000000..f4235a83d3bb --- /dev/null +++ b/include/linux/bpf_oom.h @@ -0,0 +1,22 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _BPF_OOM_H +#define _BPF_OOM_H + +#include +#include +#include + +struct bpf_oom_policy { + struct bpf_prog_array __rcu *progs; +}; + +int oom_policy_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog); +int oom_policy_prog_detach(const union bpf_attr *attr); +int oom_policy_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr); + +int __bpf_run_oom_policy(u64 cg_id_1, u64 cg_id_2); + +bool bpf_oom_policy_enabled(void); + +#endif diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index fc0d6f32c687..8ab6009b7dd9 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -83,6 +83,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SYSCALL, bpf_syscall, BPF_PROG_TYPE(BPF_PROG_TYPE_NETFILTER, netfilter, struct bpf_nf_ctx, struct bpf_nf_ctx) #endif +BPF_PROG_TYPE(BPF_PROG_TYPE_OOM_POLICY, oom_policy, + struct bpf_oom_ctx, struct bpf_oom_ctx) BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 60a9d59beeab..9da0d61cf703 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -987,6 +987,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SK_LOOKUP, BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ BPF_PROG_TYPE_NETFILTER, + BPF_PROG_TYPE_OOM_POLICY, }; enum bpf_attach_type { @@ -1036,6 +1037,7 @@ enum bpf_attach_type { BPF_LSM_CGROUP, BPF_STRUCT_OPS, BPF_NETFILTER, + BPF_OOM_POLICY, __MAX_BPF_ATTACH_TYPE }; @@ -6825,6 +6827,18 @@ struct bpf_cgroup_dev_ctx { __u32 minor; }; +enum { + BPF_OOM_CMP_EQUAL = (1ULL << 0), + BPF_OOM_CMP_GREATER = (1ULL << 1), + BPF_OOM_CMP_LESS = (1ULL << 2), +}; + +struct bpf_oom_ctx { + __u64 cg_id_1; + __u64 cg_id_2; + __u8 cmp_ret; +}; + struct bpf_raw_tracepoint_args { __u64 args[0]; }; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index a2aef900519c..fb6fb6294eba 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -5,6 +5,7 @@ #include #include #include +#include #include #include #include @@ -3588,6 +3589,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type) return BPF_PROG_TYPE_XDP; case BPF_LSM_CGROUP: return BPF_PROG_TYPE_LSM; + case BPF_OOM_POLICY: + return BPF_PROG_TYPE_OOM_POLICY; default: return BPF_PROG_TYPE_UNSPEC; } @@ -3634,6 +3637,9 @@ static int bpf_prog_attach(const union bpf_attr *attr) case BPF_PROG_TYPE_FLOW_DISSECTOR: ret = netns_bpf_prog_attach(attr, prog); break; + case BPF_PROG_TYPE_OOM_POLICY: + ret = oom_policy_prog_attach(attr, prog); + break; case BPF_PROG_TYPE_CGROUP_DEVICE: case BPF_PROG_TYPE_CGROUP_SKB: case BPF_PROG_TYPE_CGROUP_SOCK: @@ -3676,6 +3682,8 @@ static int bpf_prog_detach(const union bpf_attr *attr) return lirc_prog_detach(attr); case BPF_PROG_TYPE_FLOW_DISSECTOR: return netns_bpf_prog_detach(attr, ptype); + case BPF_PROG_TYPE_OOM_POLICY: + return oom_policy_prog_detach(attr); case BPF_PROG_TYPE_CGROUP_DEVICE: case BPF_PROG_TYPE_CGROUP_SKB: case BPF_PROG_TYPE_CGROUP_SOCK: @@ -3733,6 +3741,8 @@ static int bpf_prog_query(const union bpf_attr *attr, case BPF_FLOW_DISSECTOR: case BPF_SK_LOOKUP: return netns_bpf_prog_query(attr, uattr); + case BPF_OOM_POLICY: + return oom_policy_prog_query(attr, uattr); case BPF_SK_SKB_STREAM_PARSER: case BPF_SK_SKB_STREAM_VERDICT: case BPF_SK_MSG_VERDICT: diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 612b5597d3af..01af8adaa16c 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -19,6 +19,7 @@ */ #include +#include #include #include #include @@ -73,6 +74,9 @@ static inline bool is_memcg_oom(struct oom_control *oc) return oc->memcg != NULL; } +DEFINE_MUTEX(oom_policy_lock); +static struct bpf_oom_policy global_oom_policy; + #ifdef CONFIG_NUMA /** * oom_cpuset_eligible() - check task eligibility for kill @@ -1258,3 +1262,167 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) return -ENOSYS; #endif /* CONFIG_MMU */ } + +const struct bpf_prog_ops oom_policy_prog_ops = { +}; + +static const struct bpf_func_proto * +oom_policy_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + return bpf_base_func_proto(func_id); +} + +static bool oom_policy_is_valid_access(int off, int size, + enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + if (off < 0 || off + size > sizeof(struct bpf_oom_ctx) || off % size) + return false; + + switch (off) { + case bpf_ctx_range(struct bpf_oom_ctx, cg_id_1): + case bpf_ctx_range(struct bpf_oom_ctx, cg_id_2): + if (type != BPF_READ) + return false; + bpf_ctx_record_field_size(info, sizeof(__u64)); + return bpf_ctx_narrow_access_ok(off, size, sizeof(__u64)); + case bpf_ctx_range(struct bpf_oom_ctx, cmp_ret): + if (type == BPF_READ) { + bpf_ctx_record_field_size(info, sizeof(__u8)); + return bpf_ctx_narrow_access_ok(off, size, sizeof(__u8)); + } else { + return size == sizeof(__u8); + } + default: + return false; + } +} + +const struct bpf_verifier_ops oom_policy_verifier_ops = { + .get_func_proto = oom_policy_func_proto, + .is_valid_access = oom_policy_is_valid_access, +}; + +#define BPF_MAX_PROGS 10 + +int oom_policy_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog) +{ + struct bpf_prog_array *old_array; + struct bpf_prog_array *new_array; + int ret; + + mutex_lock(&oom_policy_lock); + old_array = rcu_dereference(global_oom_policy.progs); + if (old_array && bpf_prog_array_length(old_array) >= BPF_MAX_PROGS) { + ret = -E2BIG; + goto unlock; + } + ret = bpf_prog_array_copy(old_array, NULL, prog, 0, &new_array); + if (ret < 0) + goto unlock; + + rcu_assign_pointer(global_oom_policy.progs, new_array); + bpf_prog_array_free(old_array); + +unlock: + mutex_unlock(&oom_policy_lock); + return ret; +} + +static int detach_prog(struct bpf_prog *prog) +{ + struct bpf_prog_array *old_array; + struct bpf_prog_array *new_array; + int ret; + + mutex_lock(&oom_policy_lock); + old_array = rcu_dereference(global_oom_policy.progs); + ret = bpf_prog_array_copy(old_array, prog, NULL, 0, &new_array); + + if (ret) + goto unlock; + + rcu_assign_pointer(global_oom_policy.progs, new_array); + bpf_prog_array_free(old_array); + bpf_prog_put(prog); +unlock: + mutex_unlock(&oom_policy_lock); + return ret; +} + +int oom_policy_prog_detach(const union bpf_attr *attr) +{ + struct bpf_prog *prog; + int ret; + + if (attr->attach_flags) + return -EINVAL; + + prog = bpf_prog_get_type(attr->attach_bpf_fd, + BPF_PROG_TYPE_OOM_POLICY); + if (IS_ERR(prog)) + return PTR_ERR(prog); + + ret = detach_prog(prog); + bpf_prog_put(prog); + + return ret; +} + +int oom_policy_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr) +{ + __u32 __user *prog_ids = u64_to_user_ptr(attr->query.prog_ids); + struct bpf_prog_array *progs; + u32 cnt, flags; + int ret = 0; + + if (attr->query.query_flags) + return -EINVAL; + + mutex_lock(&oom_policy_lock); + progs = rcu_dereference(global_oom_policy.progs); + cnt = progs ? bpf_prog_array_length(progs) : 0; + if (copy_to_user(&uattr->query.prog_cnt, &cnt, sizeof(cnt))) { + ret = -EFAULT; + goto unlock; + } + if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags))) { + ret = -EFAULT; + goto unlock; + } + if (attr->query.prog_cnt != 0 && prog_ids && cnt) + ret = bpf_prog_array_copy_to_user(progs, prog_ids, + attr->query.prog_cnt); + +unlock: + mutex_unlock(&oom_policy_lock); + return ret; +} + +int __bpf_run_oom_policy(u64 cg_id_1, u64 cg_id_2) +{ + struct bpf_oom_ctx ctx = { + .cg_id_1 = cg_id_1, + .cg_id_2 = cg_id_2, + .cmp_ret = BPF_OOM_CMP_EQUAL, + }; + rcu_read_lock(); + bpf_prog_run_array(rcu_dereference(global_oom_policy.progs), + &ctx, bpf_prog_run); + rcu_read_unlock(); + return ctx.cmp_ret; +} + +bool bpf_oom_policy_enabled(void) +{ + struct bpf_prog_array *prog_array; + bool empty = true; + + rcu_read_lock(); + prog_array = rcu_dereference(global_oom_policy.progs); + if (prog_array) + empty = bpf_prog_array_is_empty(prog_array); + rcu_read_unlock(); + return !empty; +} From patchwork Thu Jul 27 07:36:29 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chuyi Zhou X-Patchwork-Id: 13329084 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C219CBE72 for ; Thu, 27 Jul 2023 07:37:47 +0000 (UTC) Received: from mail-oo1-xc30.google.com (mail-oo1-xc30.google.com [IPv6:2607:f8b0:4864:20::c30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 822C265B7 for ; Thu, 27 Jul 2023 00:37:19 -0700 (PDT) Received: by mail-oo1-xc30.google.com with SMTP id 006d021491bc7-5661eb57452so470242eaf.2 for ; Thu, 27 Jul 2023 00:37:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1690443437; x=1691048237; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=s3P4jk/DjTpvx9PdAJj920ScGLlcJW6HGkHJHTkHl6A=; b=S7FhqPdC6B45lRXe055P5PaVB94tznGk9FrHIpoUyTL61ml7EZmcmoma4G1m8PTLNX qNy+xG/t8mZ8BQRoO2wrIto0n1SbVsyupw1IUQDbWg0ss/fGxYr2YQrHj2LrK0eZM3mB sijyrWeRuFgWJUOYkTfB86DrJRKy+rO0LluT1ezkA07OZB9v+0ToMXPEJMqFGmU0CxQG QgVQPZXhXjWdhJP+Fmn1qGbZuOKCLTw+7cj5vCjF2x7/ezfUNsHM6sP0vFj38UxZ0anZ j695g+6unGfMklS8VYfNLUkSH5FMlfOF7hSnLmjmbwri1KBrbOCAgTOH+vNyhE6URK9S DZgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690443437; x=1691048237; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=s3P4jk/DjTpvx9PdAJj920ScGLlcJW6HGkHJHTkHl6A=; b=A/OVBg//h5KDcgM+hlmEUBDAtspsmmhQ/pyYdcHfpnqYcueLOloiai8RPJWX/smsv0 /IqBgd5WH3te55ngPVhrWTG3QqZ87tproWfxBEDlmFjmcd260GdRPOpSCEtHRry0ibWi KRr2WTL84EBLTC6oxGdFijKIS5/aVOGt+kK7lpLjF7yiEHGr9ltvHIE9Lv2wED0Cf9Bt fGicTsoaZFAIe3c7DB9cc7RPurHXpNpY1QLHFYmqMpkl42Utx5hjaYFJhbG5CZGgqz0G r7WsPgtY1b2S1PyfUgWeXrycrl3VC3qT2VrM1rL3J5DhM+WVgHd475/A4pYrzBZKM9XG wZtw== X-Gm-Message-State: ABy/qLY5aInPWLPb6fz0l7d4sY1wughojDi7f4pwanzJnOrFPSMmBYci d5XWxIQrTokJHY5lR/MhkIWWadkWG+p/dehcxizYpw== X-Google-Smtp-Source: APBJJlFrA6OYIMUCXYDHSgYayAl8YdYfMGMS0d7a5pDMpWYTJUyDMEysjTkvW4PpsPS4YlTUh2iBZg== X-Received: by 2002:a05:6358:5915:b0:135:4003:7851 with SMTP id g21-20020a056358591500b0013540037851mr2343798rwf.19.1690443437625; Thu, 27 Jul 2023 00:37:17 -0700 (PDT) Received: from n37-019-243.byted.org ([180.184.51.134]) by smtp.gmail.com with ESMTPSA id s196-20020a6377cd000000b005638a70110bsm733919pgc.65.2023.07.27.00.37.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Jul 2023 00:37:17 -0700 (PDT) From: Chuyi Zhou To: hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org Cc: bpf@vger.kernel.org, linux-kernel@vger.kernel.org, wuyun.abel@bytedance.com, robin.lu@bytedance.com, Chuyi Zhou Subject: [RFC PATCH 2/5] mm: Select victim memcg using bpf prog Date: Thu, 27 Jul 2023 15:36:29 +0800 Message-Id: <20230727073632.44983-3-zhouchuyi@bytedance.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20230727073632.44983-1-zhouchuyi@bytedance.com> References: <20230727073632.44983-1-zhouchuyi@bytedance.com> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net X-Patchwork-State: RFC This patch use BPF prog to bypass the default select_bad_process method and select a victim memcg when gobal oom is invoked. Specifically, we iterate root_mem_cgroup's children and select a next iteration root through __bpf_run_oom_policy(). Repeat until we finally find a leaf memcg in the last layer. Then we use oom_evaluate_task() to find a victim task in the selected memcg. If there are no suitable process to be killed in the memcg, we go back to the default method. Suggested-by: Abel Wu Signed-off-by: Chuyi Zhou --- include/linux/memcontrol.h | 6 +++++ mm/memcontrol.c | 50 ++++++++++++++++++++++++++++++++++++++ mm/oom_kill.c | 17 +++++++++++++ 3 files changed, 73 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5818af8eca5a..7fedc2521c8b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1155,6 +1155,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, unsigned long *total_scanned); +struct mem_cgroup *select_victim_memcg(void); #else /* CONFIG_MEMCG */ #define MEM_CGROUP_ID_SHIFT 0 @@ -1588,6 +1589,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, { return 0; } + +static inline struct mem_cgroup *select_victim_memcg(void) +{ + return NULL; +} #endif /* CONFIG_MEMCG */ static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e8ca4bdcb03c..c6b42635f1af 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -64,6 +64,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -2638,6 +2639,55 @@ void mem_cgroup_handle_over_high(void) css_put(&memcg->css); } +struct mem_cgroup *select_victim_memcg(void) +{ + struct cgroup_subsys_state *pos, *parent, *victim; + struct mem_cgroup *victim_memcg; + + parent = &root_mem_cgroup->css; + victim_memcg = NULL; + + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return NULL; + + rcu_read_lock(); + while (parent) { + struct cgroup_subsys_state *chosen = NULL; + struct mem_cgroup *pos_mem, *chosen_mem; + u64 chosen_id, pos_id; + int cmp_ret; + + victim = parent; + + list_for_each_entry_rcu(pos, &parent->children, sibling) { + pos_id = cgroup_id(pos->cgroup); + if (!chosen) + goto chose; + + cmp_ret = __bpf_run_oom_policy(chosen_id, pos_id); + if (cmp_ret == BPF_OOM_CMP_GREATER) + continue; + if (cmp_ret == BPF_OOM_CMP_EQUAL) { + pos_mem = mem_cgroup_from_css(pos); + chosen_mem = mem_cgroup_from_css(chosen); + if (page_counter_read(&pos_mem->memory) <= + page_counter_read(&chosen_mem->memory)) + continue; + } +chose: + chosen = pos; + chosen_id = pos_id; + } + parent = chosen; + } + + if (victim && css_tryget(victim)) + victim_memcg = mem_cgroup_from_css(victim); + rcu_read_unlock(); + + return victim_memcg; +} + static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages) { diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 01af8adaa16c..b88c8c7d4ee4 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -361,6 +361,19 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) return 1; } +static bool bpf_select_bad_process(struct oom_control *oc) +{ + struct mem_cgroup *victim_memcg; + + victim_memcg = select_victim_memcg(); + if (victim_memcg) { + mem_cgroup_scan_tasks(victim_memcg, oom_evaluate_task, oc); + css_put(&victim_memcg->css); + } + + return !!oc->chosen; +} + /* * Simple selection loop. We choose the process with the highest number of * 'points'. In case scan was aborted, oc->chosen is set to -1. @@ -372,6 +385,9 @@ static void select_bad_process(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); else { + if (bpf_oom_policy_enabled() && bpf_select_bad_process(oc)) + return; + struct task_struct *p; rcu_read_lock(); @@ -1426,3 +1442,4 @@ bool bpf_oom_policy_enabled(void) rcu_read_unlock(); return !empty; } + From patchwork Thu Jul 27 07:36:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chuyi Zhou X-Patchwork-Id: 13329085 X-Patchwork-Delegate: bpf@iogearbox.net Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1D9FBBE72 for ; Thu, 27 Jul 2023 07:37:50 +0000 (UTC) Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com [IPv6:2607:f8b0:4864:20::435]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C52926A42 for ; Thu, 27 Jul 2023 00:37:23 -0700 (PDT) Received: by mail-pf1-x435.google.com with SMTP id d2e1a72fcca58-686f94328a4so119404b3a.0 for ; Thu, 27 Jul 2023 00:37:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1690443442; x=1691048242; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=r//0VfNIsSS7LjEKz2w3sxhtDAhJR7x5LJkBswgmGuQ=; b=bvNQcm1txH3rhOO7bm1O9db1gBgcBMCQs09BOjh+n6R3VBja2J59RqrtgNWO9awRAF EmhJsX7mQDb9SHJfhSrkICHn+WQAe+gtR3inPpRAzBY6qQm1agnybe8DuKumdO5jlFKe 3YHkH7nP76dnwLFRRldyNlJof4ZD7ZNlpVEshWiVntSoFuZJgffmlqMqIhPq7i/8LGKD cnP86b8NCN8urnuP2ojgUdnkOLytJMwGVrSs3Rp5ZqjCx+OJDr2iX6bR9VytoWWz39b+ J/IBNZIBta/QlN/Xir19mHZ05UZLuiWy9NoVByRL9UYOatX6ERlOMwp7Y8Ubo90+OnSE +4Sw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690443442; x=1691048242; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=r//0VfNIsSS7LjEKz2w3sxhtDAhJR7x5LJkBswgmGuQ=; b=QQpAf5RcsZKOyFopD9cArgz5Gv68OcY3j/oj1dcBLDtfpIDF1+t4oF+LwKxMsDq2Y4 gO7vagb9v6Pl23xS9Ski+SPLoz2xE/9UUDhO1sd/jATpdHPukmT3o1Is9Th3lZRl0Jgf 7MH2MrMko+GKkYWu+sf6LKDXovkettvugUXRy39Wo2vLnV3xIK9D6YNwZqRn525fxq0o uJGqYo6Y+WCOTyVzuUMX6+26f7yuoZVOOmP/OKBU6ntOuaGp2rpGe8We83Jc1iTo/igx ylOwB3ShPfvwDN+HKfnVloLCfqUbzf/nH0m94XFr9k98QP1LykPHTfv1Uy9BLK98egyb Wn4A== X-Gm-Message-State: ABy/qLagr5ksXyqR+H9GxymssEbhdb7+C6Xs30RKMQ/QNoSMHJ2tmIu1 ecKitxQe9MmZ2KskM/GGc+eccA== X-Google-Smtp-Source: APBJJlG4+R4jPSCVEGlR6rgLCmNVhaHd81UbthPsy8RJlP61GxssPJqO3tI7kRrjB9mH7WDD/QMfIA== X-Received: by 2002:a05:6a21:329d:b0:13a:cfdf:d7a1 with SMTP id yt29-20020a056a21329d00b0013acfdfd7a1mr2311681pzb.2.1690443442119; Thu, 27 Jul 2023 00:37:22 -0700 (PDT) Received: from n37-019-243.byted.org ([180.184.51.134]) by smtp.gmail.com with ESMTPSA id s196-20020a6377cd000000b005638a70110bsm733919pgc.65.2023.07.27.00.37.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Jul 2023 00:37:21 -0700 (PDT) From: Chuyi Zhou To: hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org Cc: bpf@vger.kernel.org, linux-kernel@vger.kernel.org, wuyun.abel@bytedance.com, robin.lu@bytedance.com, Chuyi Zhou Subject: [RFC PATCH 3/5] libbpf, bpftool: Support BPF_PROG_TYPE_OOM_POLICY Date: Thu, 27 Jul 2023 15:36:30 +0800 Message-Id: <20230727073632.44983-4-zhouchuyi@bytedance.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20230727073632.44983-1-zhouchuyi@bytedance.com> References: <20230727073632.44983-1-zhouchuyi@bytedance.com> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC Support BPF_PROG_TYPE_OOM_POLICY program in libbpf and bpftool, so that we can identify and use BPF_PROG_TYPE_OOM_POLICY in our application. Signed-off-by: Chuyi Zhou --- tools/bpf/bpftool/common.c | 1 + tools/include/uapi/linux/bpf.h | 14 ++++++++++++++ tools/lib/bpf/libbpf.c | 3 +++ tools/lib/bpf/libbpf_probes.c | 2 ++ 4 files changed, 20 insertions(+) diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c index cc6e6aae2447..c5c311299c4a 100644 --- a/tools/bpf/bpftool/common.c +++ b/tools/bpf/bpftool/common.c @@ -1089,6 +1089,7 @@ const char *bpf_attach_type_input_str(enum bpf_attach_type t) case BPF_TRACE_FENTRY: return "fentry"; case BPF_TRACE_FEXIT: return "fexit"; case BPF_MODIFY_RETURN: return "mod_ret"; + case BPF_OOM_POLICY: return "oom_policy"; case BPF_SK_REUSEPORT_SELECT: return "sk_skb_reuseport_select"; case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE: return "sk_skb_reuseport_select_or_migrate"; default: return libbpf_bpf_attach_type_str(t); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 60a9d59beeab..9da0d61cf703 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -987,6 +987,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SK_LOOKUP, BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ BPF_PROG_TYPE_NETFILTER, + BPF_PROG_TYPE_OOM_POLICY, }; enum bpf_attach_type { @@ -1036,6 +1037,7 @@ enum bpf_attach_type { BPF_LSM_CGROUP, BPF_STRUCT_OPS, BPF_NETFILTER, + BPF_OOM_POLICY, __MAX_BPF_ATTACH_TYPE }; @@ -6825,6 +6827,18 @@ struct bpf_cgroup_dev_ctx { __u32 minor; }; +enum { + BPF_OOM_CMP_EQUAL = (1ULL << 0), + BPF_OOM_CMP_GREATER = (1ULL << 1), + BPF_OOM_CMP_LESS = (1ULL << 2), +}; + +struct bpf_oom_ctx { + __u64 cg_id_1; + __u64 cg_id_2; + __u8 cmp_ret; +}; + struct bpf_raw_tracepoint_args { __u64 args[0]; }; diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 214f828ece6b..10496bb9b3bc 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -118,6 +118,7 @@ static const char * const attach_type_name[] = { [BPF_TRACE_KPROBE_MULTI] = "trace_kprobe_multi", [BPF_STRUCT_OPS] = "struct_ops", [BPF_NETFILTER] = "netfilter", + [BPF_OOM_POLICY] = "oom_policy", }; static const char * const link_type_name[] = { @@ -204,6 +205,7 @@ static const char * const prog_type_name[] = { [BPF_PROG_TYPE_SK_LOOKUP] = "sk_lookup", [BPF_PROG_TYPE_SYSCALL] = "syscall", [BPF_PROG_TYPE_NETFILTER] = "netfilter", + [BPF_PROG_TYPE_OOM_POLICY] = "oom_policy", }; static int __base_pr(enum libbpf_print_level level, const char *format, @@ -8738,6 +8740,7 @@ static const struct bpf_sec_def section_defs[] = { SEC_DEF("struct_ops.s+", STRUCT_OPS, 0, SEC_SLEEPABLE), SEC_DEF("sk_lookup", SK_LOOKUP, BPF_SK_LOOKUP, SEC_ATTACHABLE), SEC_DEF("netfilter", NETFILTER, BPF_NETFILTER, SEC_NONE), + SEC_DEF("oom_policy", OOM_POLICY, BPF_OOM_POLICY, SEC_ATTACHABLE_OPT), }; static size_t custom_sec_def_cnt; diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c index 9c4db90b92b6..dbac3e98a2d7 100644 --- a/tools/lib/bpf/libbpf_probes.c +++ b/tools/lib/bpf/libbpf_probes.c @@ -129,6 +129,8 @@ static int probe_prog_load(enum bpf_prog_type prog_type, case BPF_PROG_TYPE_LIRC_MODE2: opts.expected_attach_type = BPF_LIRC_MODE2; break; + case BPF_PROG_TYPE_OOM_POLICY: + opts.expected_attach_type = BPF_OOM_POLICY; case BPF_PROG_TYPE_TRACING: case BPF_PROG_TYPE_LSM: opts.log_buf = buf; From patchwork Thu Jul 27 07:36:31 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chuyi Zhou X-Patchwork-Id: 13329086 X-Patchwork-Delegate: bpf@iogearbox.net Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D45CBBE72 for ; Thu, 27 Jul 2023 07:37:54 +0000 (UTC) Received: from mail-oi1-x22d.google.com (mail-oi1-x22d.google.com [IPv6:2607:f8b0:4864:20::22d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 76F564200 for ; Thu, 27 Jul 2023 00:37:28 -0700 (PDT) Received: by mail-oi1-x22d.google.com with SMTP id 5614622812f47-3a38953c928so601644b6e.1 for ; Thu, 27 Jul 2023 00:37:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1690443447; x=1691048247; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Uv3OYCgolDRnDe8sjWn08GTjavdgRys1P8gpfRFXymU=; b=GPxKU1wLA/ico6qT1nEEu7gHQfgASi8rquE1AHhcXYJT1YIgQWCl0gYkO+q7Q4Ajd8 ZGhT1ge1OlP7dnRkKC9NzY2sUvhgM/h6rPXqBu3xK1S9r98sHCrdmoTvcTfANYWZBoLd xfVIwZRmn2yWUEfGaYPS4U465p+g49PadC9fcYd0hx5ojquQUhWVeEKTqP+pZ9xvinLc bNKbsys0ZIGHmQCTT0vWBHPadAdI12u2RXt8ygcYMQkFpLj8uIdMWG4xFoQS93HgAPsT PwYp3/fy/qrx1r2C2MV0kCEoJMDJandD7xo85GeTJRyvBJxZslJGlHE1vTxBwPd27657 T+pw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690443447; x=1691048247; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Uv3OYCgolDRnDe8sjWn08GTjavdgRys1P8gpfRFXymU=; b=ll7KkfTqLHDd5JQJbjHTChLGIVYVpQR14nA2r3SEolDnRNYQ+r0YKGorTJLZJl4ZJn GvHBA9+zTnMymG+1R+H2aXVB2WX/l9asCCIunupJNYjjV19CPnqM+rQ4tIp6OcR7spO3 eFPCgCbrHcDc4mmRDMkKuCpum/SlChR8eBWRrQQCtGMy9Bq5byCkXJZ2wqUpS/qHWDxH vLi31DGEXUXDpjeu5FAOaKYR6S36K1BF6c5/u6PfivCieVDuKn6hpjiSktPKQmOKRm94 5NRlTmABgnTO/Nn9ueyUp7m0uQUf3pcsFfRej5EjySefkFk+MEC12JS276yCdU4Ds9iy Al9A== X-Gm-Message-State: ABy/qLa3OY39a+G97tOfRVsrn3aqEV31zPwsObfGiZDnqkY5trIWMntS /IDMrXr8p576xlWSbEXXEHxQ7w== X-Google-Smtp-Source: APBJJlGHqT5qPjTtSbXoYNJy1i+7DtwD7lYrnGBfTg7InyflfRjIZJ5jUGAx6rk7oi3Ui6choddTkw== X-Received: by 2002:a05:6808:1393:b0:3a5:ca93:fb69 with SMTP id c19-20020a056808139300b003a5ca93fb69mr2618037oiw.55.1690443447035; Thu, 27 Jul 2023 00:37:27 -0700 (PDT) Received: from n37-019-243.byted.org ([180.184.51.134]) by smtp.gmail.com with ESMTPSA id s196-20020a6377cd000000b005638a70110bsm733919pgc.65.2023.07.27.00.37.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Jul 2023 00:37:26 -0700 (PDT) From: Chuyi Zhou To: hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org Cc: bpf@vger.kernel.org, linux-kernel@vger.kernel.org, wuyun.abel@bytedance.com, robin.lu@bytedance.com, Chuyi Zhou Subject: [RFC PATCH 4/5] bpf: Add a new bpf helper to get cgroup ino Date: Thu, 27 Jul 2023 15:36:31 +0800 Message-Id: <20230727073632.44983-5-zhouchuyi@bytedance.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20230727073632.44983-1-zhouchuyi@bytedance.com> References: <20230727073632.44983-1-zhouchuyi@bytedance.com> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC This patch adds a new bpf helper bpf_get_ino_from_cgroup_id, so that we can get the inode number once we know the cgroup id. Cgroup_id is used to identify a cgroup in BPF prog. However we can't get the cgroup id directly in userspace applications. In userspace, we are used to identifying cgroups by their paths or their inodes. However, cgroup id is not always equal to the inode number, depending on the sizeof ino_t. For example, given some cgroup paths, we only care about the events related to those cgroups. We can only do this by updating these paths in a map and doing string comparison in BPF prog, which is not very convenient. However with this new helper, we just need to record the inode in a map and lookup a inode number in BPF prog. Signed-off-by: Chuyi Zhou --- include/uapi/linux/bpf.h | 7 +++++++ kernel/bpf/core.c | 1 + kernel/bpf/helpers.c | 17 +++++++++++++++++ tools/include/uapi/linux/bpf.h | 7 +++++++ 4 files changed, 32 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 9da0d61cf703..01efb289fa14 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5575,6 +5575,12 @@ union bpf_attr { * 0 on success. * * **-ENOENT** if the bpf_local_storage cannot be found. + * + * u64 bpf_get_ino_from_cgroup_id(u64 id) + * Description + * Get inode number from a *cgroup id*. + * Return + * Inode number. */ #define ___BPF_FUNC_MAPPER(FN, ctx...) \ FN(unspec, 0, ##ctx) \ @@ -5789,6 +5795,7 @@ union bpf_attr { FN(user_ringbuf_drain, 209, ##ctx) \ FN(cgrp_storage_get, 210, ##ctx) \ FN(cgrp_storage_delete, 211, ##ctx) \ + FN(get_ino_from_cgroup_id, 212, ##ctx) \ /* */ /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index dc85240a0134..49dfdb2dd336 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -2666,6 +2666,7 @@ const struct bpf_func_proto bpf_snprintf_btf_proto __weak; const struct bpf_func_proto bpf_seq_printf_btf_proto __weak; const struct bpf_func_proto bpf_set_retval_proto __weak; const struct bpf_func_proto bpf_get_retval_proto __weak; +const struct bpf_func_proto bpf_get_ino_from_cgroup_id_proto __weak; const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) { diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 9e80efa59a5d..e87328b008d3 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -433,6 +433,21 @@ const struct bpf_func_proto bpf_get_current_ancestor_cgroup_id_proto = { .ret_type = RET_INTEGER, .arg1_type = ARG_ANYTHING, }; + +BPF_CALL_1(bpf_get_ino_from_cgroup_id, u64, id) +{ + u64 ino = kernfs_id_ino(id); + + return ino; +} + +const struct bpf_func_proto bpf_get_ino_from_cgroup_id_proto = { + .func = bpf_get_ino_from_cgroup_id, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING, +}; + #endif /* CONFIG_CGROUPS */ #define BPF_STRTOX_BASE_MASK 0x1F @@ -1767,6 +1782,8 @@ bpf_base_func_proto(enum bpf_func_id func_id) return &bpf_get_current_cgroup_id_proto; case BPF_FUNC_get_current_ancestor_cgroup_id: return &bpf_get_current_ancestor_cgroup_id_proto; + case BPF_FUNC_get_ino_from_cgroup_id: + return &bpf_get_ino_from_cgroup_id_proto; #endif default: break; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 9da0d61cf703..661d97aacb85 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5575,6 +5575,12 @@ union bpf_attr { * 0 on success. * * **-ENOENT** if the bpf_local_storage cannot be found. + * + * u64 bpf_get_ino_from_cgroup_id(u64 id) + * Description + * Get inode number from a *cgroup id*. + * Return + * Inode number. */ #define ___BPF_FUNC_MAPPER(FN, ctx...) \ FN(unspec, 0, ##ctx) \ @@ -5789,6 +5795,7 @@ union bpf_attr { FN(user_ringbuf_drain, 209, ##ctx) \ FN(cgrp_storage_get, 210, ##ctx) \ FN(cgrp_storage_delete, 211, ##ctx) \ + FN(get_ino_from_cgroup_id, 212, ##ctx) \ /* */ /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't From patchwork Thu Jul 27 07:36:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chuyi Zhou X-Patchwork-Id: 13329087 X-Patchwork-Delegate: bpf@iogearbox.net Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5054FBE72 for ; Thu, 27 Jul 2023 07:37:58 +0000 (UTC) Received: from mail-oo1-xc2e.google.com (mail-oo1-xc2e.google.com [IPv6:2607:f8b0:4864:20::c2e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A534983F1 for ; Thu, 27 Jul 2023 00:37:33 -0700 (PDT) Received: by mail-oo1-xc2e.google.com with SMTP id 006d021491bc7-56584266c41so508330eaf.2 for ; Thu, 27 Jul 2023 00:37:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1690443451; x=1691048251; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=8GeXvqoGvKsq4eXzt40D65FoTpyg+BIztu+fO5t6OCk=; b=dB2YLL1oG+E1Gu38lWcd1QP/sEw5hL8k3plmGS+U5Al3NgCvB7aFDUnC/qLQqQvAyh kqM/U2FAMGOjzP7vlJ9n8P/SZJBYH4bfLW1EvEcQ8MCmw9uS6Hw3GeRrODYe0L60MU+f YwmG9B/u7gii5/6i9CQRybNJCrNZSbgvV6gFZSnnm/ALyvFAo1eRCOpOrG5Uf3Y0sXmh 2I3HOZMpRa4VcT9RhhSZBKD/SwgNZyOHYpzTl1svA95LuQvNwJAifna4peCdWQqysLF3 X77a4V5GbW5Mh0b4YJ/zyr4l1aEKHsoaGgzHv09uKleh4kLjbrbougScHwlnxloBfpmd EACw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690443451; x=1691048251; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8GeXvqoGvKsq4eXzt40D65FoTpyg+BIztu+fO5t6OCk=; b=efjLd4In2FjPQwY2ELBrHxjfA6BHmWclyLAiO+s10jYMdC2D5DS9rWFy7IfQfWBQIW C5FFuWBV7BtQ6UF4ONUXVnW2a+gPjC/K9f3+JCuKW8N75Ov+FmjT/QfzAMktgbd+oR8e mVpfh3nwnIpbefPDjnQ32CaYqxeLLJ6B2ExAjLOrWzA3epE96v2KMzkaw0QtUp2fLBDf CElK91EMdMO8T6nf3TqzwCkmnC5JFxoivMuGBcsCOUmbPeRjVXskzD5+6ktGPSnpHrnM UIyVdEwiCRqZiE5yCUpmN0t6VMyEE259rfKoQZbDUO2Rr6bbG/gwM1ZQaF4R5nNHh7pU vgRg== X-Gm-Message-State: ABy/qLZzuJbdKRDIobXKJj4FAdAAdevqPU6zqPjXiLpqKVvEE12kGhGs 7TOSMuHcTp4Tjy3QviKAKb9GZg== X-Google-Smtp-Source: APBJJlEbdWM9h+ozFiIR+UT3VqWYJ9e/huCVjkTvSArCpkIMxLuORDHlL2TxsP9+6Q1hh5HBiDtCJQ== X-Received: by 2002:aca:1a16:0:b0:3a1:acef:7e2c with SMTP id a22-20020aca1a16000000b003a1acef7e2cmr1775792oia.58.1690443451584; Thu, 27 Jul 2023 00:37:31 -0700 (PDT) Received: from n37-019-243.byted.org ([180.184.51.134]) by smtp.gmail.com with ESMTPSA id s196-20020a6377cd000000b005638a70110bsm733919pgc.65.2023.07.27.00.37.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Jul 2023 00:37:31 -0700 (PDT) From: Chuyi Zhou To: hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org Cc: bpf@vger.kernel.org, linux-kernel@vger.kernel.org, wuyun.abel@bytedance.com, robin.lu@bytedance.com, Chuyi Zhou Subject: [RFC PATCH 5/5] bpf: Sample BPF program to set oom policy Date: Thu, 27 Jul 2023 15:36:32 +0800 Message-Id: <20230727073632.44983-6-zhouchuyi@bytedance.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20230727073632.44983-1-zhouchuyi@bytedance.com> References: <20230727073632.44983-1-zhouchuyi@bytedance.com> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC This patch adds a sample showing how to set a OOM victim selection policy to protect certain cgroups. The BPF program, oom_kern.c, compares the score of two sibling memcg and selects the larger one. The userspace program oom_user.c maintains a score map by using cgroup inode number as the keys and the scores as the values. Users can set lower score for some cgroups compared to their siblings to avoid being selected. Suggested-by: Abel Wu Signed-off-by: Chuyi Zhou --- samples/bpf/Makefile | 3 + samples/bpf/oom_kern.c | 42 ++++++++++++++ samples/bpf/oom_user.c | 128 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 173 insertions(+) create mode 100644 samples/bpf/oom_kern.c create mode 100644 samples/bpf/oom_user.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 615f24ebc49c..09dbdec22dad 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -56,6 +56,7 @@ tprogs-y += xdp_redirect_map_multi tprogs-y += xdp_redirect_map tprogs-y += xdp_redirect tprogs-y += xdp_monitor +tprogs-y += oom # Libbpf dependencies LIBBPF_SRC = $(TOOLS_PATH)/lib/bpf @@ -118,6 +119,7 @@ xdp_redirect_map-objs := xdp_redirect_map_user.o $(XDP_SAMPLE) xdp_redirect-objs := xdp_redirect_user.o $(XDP_SAMPLE) xdp_monitor-objs := xdp_monitor_user.o $(XDP_SAMPLE) xdp_router_ipv4-objs := xdp_router_ipv4_user.o $(XDP_SAMPLE) +oom-objs := oom_user.o # Tell kbuild to always build the programs always-y := $(tprogs-y) @@ -173,6 +175,7 @@ always-y += xdp_sample_pkts_kern.o always-y += ibumad_kern.o always-y += hbm_out_kern.o always-y += hbm_edt_kern.o +always-y += oom_kern.o ifeq ($(ARCH), arm) # Strip all except -D__LINUX_ARM_ARCH__ option needed to handle linux diff --git a/samples/bpf/oom_kern.c b/samples/bpf/oom_kern.c new file mode 100644 index 000000000000..1e0e2de1e06e --- /dev/null +++ b/samples/bpf/oom_kern.c @@ -0,0 +1,42 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 1024); + __type(key, u64); + __type(value, u32); +} sc_map SEC(".maps"); + +SEC("oom_policy") +int bpf_prog1(struct bpf_oom_ctx *ctx) +{ + u64 cg_ino_1, cg_ino_2; + u32 cs_1, sc_2; + u32 *value; + + cs_1 = sc_2 = 250; + cg_ino_1 = bpf_get_ino_from_cgroup_id(ctx->cg_id_1); + cg_ino_2 = bpf_get_ino_from_cgroup_id(ctx->cg_id_2); + + value = bpf_map_lookup_elem(&sc_map, &cg_ino_1); + if (value) + cs_1 = *value; + + value = bpf_map_lookup_elem(&sc_map, &cg_ino_2); + if (value) + sc_2 = *value; + + if (cs_1 > sc_2) + ctx->cmp_ret = BPF_OOM_CMP_GREATER; + else if (cs_1 < sc_2) + ctx->cmp_ret = BPF_OOM_CMP_LESS; + else + ctx->cmp_ret = BPF_OOM_CMP_EQUAL; + return 0; +} + +char _license[] SEC("license") = "GPL"; diff --git a/samples/bpf/oom_user.c b/samples/bpf/oom_user.c new file mode 100644 index 000000000000..7bd2d56ba910 --- /dev/null +++ b/samples/bpf/oom_user.c @@ -0,0 +1,128 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "trace_helpers.h" + +static int map_fd, prog_fd; + +static unsigned long long get_cgroup_inode(const char *path) +{ + unsigned long long inode; + struct stat file_stat; + int fd, ret; + + fd = open(path, O_RDONLY); + if (fd < 0) + return 0; + + ret = fstat(fd, &file_stat); + if (ret < 0) + return 0; + + inode = file_stat.st_ino; + close(fd); + return inode; +} + +static int set_cgroup_oom_score(const char *cg_path, int score) +{ + unsigned long long ino = get_cgroup_inode(cg_path); + + if (!ino) { + fprintf(stderr, "ERROR: get inode for %s failed\n", cg_path); + return 1; + } + if (bpf_map_update_elem(map_fd, &ino, &score, BPF_ANY)) { + fprintf(stderr, "ERROR: update map failed\n"); + return 1; + } + + return 0; +} + +/** + * A simple sample of prefer select /root/blue/instance_1 as victim memcg + * and protect /root/blue/instance_2 + * root + * / \ + * user ... blue + * / \ / \ + * .. instance_1 instance_2 + */ + +int main(int argc, char **argv) +{ + struct bpf_object *obj = NULL; + struct bpf_program *prog; + int target_fd = 0; + unsigned int prog_cnt; + + obj = bpf_object__open_file("oom_kern.o", NULL); + if (libbpf_get_error(obj)) { + fprintf(stderr, "ERROR: opening BPF object file failed\n"); + obj = NULL; + goto cleanup; + } + + prog = bpf_object__next_program(obj, NULL); + bpf_program__set_type(prog, BPF_PROG_TYPE_OOM_POLICY); + /* load BPF program */ + if (bpf_object__load(obj)) { + fprintf(stderr, "ERROR: loading BPF object file failed\n"); + goto cleanup; + } + + map_fd = bpf_object__find_map_fd_by_name(obj, "sc_map"); + + if (map_fd < 0) { + fprintf(stderr, "ERROR: finding a map in obj file failed\n"); + goto cleanup; + } + + /* + * In this sample, default score is 250 (see oom_kern.c). + * set high score for /blue and /blue/instance_1, + * so when global oom happened, /blue/instance_1 would + * be chosed as victim memcg + */ + if (set_cgroup_oom_score("/sys/fs/cgroup/blue/", 500)) { + fprintf(stderr, "ERROR: set score for /blue failed\n"); + goto cleanup; + } + if (set_cgroup_oom_score("/sys/fs/cgroup/blue/instance_1", 500)) { + fprintf(stderr, "ERROR: set score for /blue/instance_2 failed\n"); + goto cleanup; + } + + /* set low score to protect /blue/instance_2 */ + if (set_cgroup_oom_score("/sys/fs/cgroup/blue/instance_2", 100)) { + fprintf(stderr, "ERROR: set score for /blue/instance_1 failed\n"); + goto cleanup; + } + + prog_fd = bpf_program__fd(prog); + + /* Attach bpf program */ + if (bpf_prog_attach(prog_fd, target_fd, BPF_OOM_POLICY, 0)) { + fprintf(stderr, "Failed to attach BPF_OOM_POLICY program"); + goto cleanup; + } + if (bpf_prog_query(target_fd, BPF_OOM_POLICY, 0, NULL, NULL, &prog_cnt)) { + fprintf(stderr, "Failed to query attached programs\n"); + goto cleanup; + } + printf("prog_cnt: %d\n", prog_cnt); + +cleanup: + bpf_object__close(obj); + return 0; +}