From patchwork Tue May 10 00:17:59 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yosry Ahmed <yosryahmed@google.com>
X-Patchwork-Id: 12844365
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0995EC433FE
	for <netdev@archiver.kernel.org>; Tue, 10 May 2022 00:18:37 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233494AbiEJAWa (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Mon, 9 May 2022 20:22:30 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60848 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233071AbiEJAWU (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 9 May 2022 20:22:20 -0400
Received: from mail-pg1-x54a.google.com (mail-pg1-x54a.google.com
 [IPv6:2607:f8b0:4864:20::54a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 14B2828B683
        for <netdev@vger.kernel.org>; Mon,  9 May 2022 17:18:23 -0700 (PDT)
Received: by mail-pg1-x54a.google.com with SMTP id
 204-20020a6302d5000000b003c273168068so8020628pgc.21
        for <netdev@vger.kernel.org>; Mon, 09 May 2022 17:18:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=cLSVxAMPxomX8iOSFW0ZVKsdAIndGhXT5AJQPUvW5Fg=;
        b=rLYiXhWSIPTO4QlsWWVnMB1yQrJnNU7f2lB2/RPQ2ugJWM5oSMhBEWvb6LCp4B+0WS
         u4qMiBtf6yrbSqAt2KnsWbqoO5CFt/3c4eh4rVhSJP3d0b/X3y2qHkt+9kj7v22wVidY
         JsKo5SANuI7byyKq4WDQcmqrtbvHG5hFIWON2tILrixPqwXs2CLhW/tC8ii4jxhUGndz
         wwMyd5xwLFLV9aKrQEhwQAe74ePObo2MfNbE1JvJxuoEoOmbSrebCAsBFGA3VvKp1OKq
         3DZaj1wUu2Sp5QDfX5rE1gYeS7MxoXeo/dA3W7faKAP0qYN1qJvf0JNZNl4ezqugx9f2
         K5rA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=cLSVxAMPxomX8iOSFW0ZVKsdAIndGhXT5AJQPUvW5Fg=;
        b=z2I8SGFZIcqn5WdwQ+dKXW1migbprz3KPhFb0VtPYjSoNbcoP5gmxagTcRaCtjbq94
         vqlIsEc0Twy/aVgKnnT1l8TeFnEvskTOMssC1hVUmBMSXxseNaQLrcle9q/A20TWszaE
         xAHbkmVMsWi2rQGHCBD8+4DTTqXgJzgEsvXKuzuJcqOtmwRbJRxSRhzfGcMrWkHLQ+IV
         i1Pv7BB0rbtiz1doKxYDTYAx1XSN3ZAdcWr7caGuY7EJZEB1PGmPim88YLFdLwg3Tfg/
         Ei+jEj2jZTU/WIPrFrdQoPdPLgq7oQnwDdnVaTS+WlWZ6Mi8waLogF1Jb6VH+zVnMAR/
         gEVQ==
X-Gm-Message-State: AOAM53239Dyml2oasvD7izmVci21sF8QLdhlF8Hj04uhMnPoqrzojay5
        qxilmxHDEmqHlRfzCE9E9h7kmeosA8pTWNdy
X-Google-Smtp-Source: 
 ABdhPJy5CHl16lJuEoEk8yUWwJ2YnLdPqjPbpSJT9KvZJxYooTvfRnH08buP7FRUwmZ7ja8CnSzZUhPDZe93mNpU
X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327])
 (user=yosryahmed job=sendgmr) by 2002:a05:6a00:b94:b0:50f:2255:ae03 with SMTP
 id g20-20020a056a000b9400b0050f2255ae03mr18349791pfj.74.1652141902482; Mon,
 09 May 2022 17:18:22 -0700 (PDT)
Date: Tue, 10 May 2022 00:17:59 +0000
In-Reply-To: <20220510001807.4132027-1-yosryahmed@google.com>
Message-Id: <20220510001807.4132027-2-yosryahmed@google.com>
Mime-Version: 1.0
References: <20220510001807.4132027-1-yosryahmed@google.com>
X-Mailer: git-send-email 2.36.0.512.ge40c2bad7a-goog
Subject: [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program
 type
From: Yosry Ahmed <yosryahmed@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <kafai@fb.com>,
        Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>, Hao Luo <haoluo@google.com>,
        Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Shuah Khan <shuah@kernel.org>,
        Roman Gushchin <roman.gushchin@linux.dev>,
        Michal Hocko <mhocko@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>,
        David Rientjes <rientjes@google.com>,
        Greg Thelen <gthelen@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
        bpf@vger.kernel.org, cgroups@vger.kernel.org,
        Yosry Ahmed <yosryahmed@google.com>
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

This patch introduces a new bpf program type CGROUP_SUBSYS_RSTAT,
with new corresponding link and attach types.

The main purpose of these programs is to allow BPF programs to collect
and maintain hierarchical cgroup stats easily and efficiently by making
using of the rstat framework in the kernel.

Those programs attach to a cgroup subsystem. They typically contain logic
to aggregate per-cpu and per-cgroup stats collected by other BPF programs.

Currently, only rstat flusher programs can be attached to cgroup
subsystems, but this can be extended later if a use-case arises.

See the selftest in the final patch for a practical example.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf-cgroup-subsys.h |  30 ++++++
 include/linux/bpf_types.h         |   2 +
 include/linux/cgroup-defs.h       |   4 +
 include/uapi/linux/bpf.h          |  12 +++
 kernel/bpf/Makefile               |   1 +
 kernel/bpf/cgroup_subsys.c        | 166 ++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c              |   6 ++
 kernel/cgroup/cgroup.c            |   1 +
 tools/include/uapi/linux/bpf.h    |  12 +++
 9 files changed, 234 insertions(+)
 create mode 100644 include/linux/bpf-cgroup-subsys.h
 create mode 100644 kernel/bpf/cgroup_subsys.c

diff --git a/include/linux/bpf-cgroup-subsys.h b/include/linux/bpf-cgroup-subsys.h
new file mode 100644
index 000000000000..4dcde06b5599
--- /dev/null
+++ b/include/linux/bpf-cgroup-subsys.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2022 Google LLC.
+ */
+#ifndef _BPF_CGROUP_SUBSYS_H_
+#define _BPF_CGROUP_SUBSYS_H_
+
+#include <linux/bpf.h>
+
+struct cgroup_subsys_bpf {
+	/* Head of the list of BPF rstat flushers attached to this subsystem */
+	struct list_head rstat_flushers;
+	spinlock_t flushers_lock;
+};
+
+struct bpf_subsys_rstat_flusher {
+	struct bpf_prog *prog;
+	/* List of BPF rtstat flushers, anchored at subsys->bpf */
+	struct list_head list;
+};
+
+struct bpf_cgroup_subsys_link {
+	struct bpf_link link;
+	struct cgroup_subsys *ss;
+};
+
+int cgroup_subsys_bpf_link_attach(const union bpf_attr *attr,
+				  struct bpf_prog *prog);
+
+#endif  // _BPF_CGROUP_SUBSYS_H_
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 3e24ad0c4b3c..854ee958b0e4 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -56,6 +56,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl,
 	      struct bpf_sysctl, struct bpf_sysctl_kern)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt,
 	      struct bpf_sockopt, struct bpf_sockopt_kern)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT, cgroup_subsys_rstat,
+	      struct bpf_rstat_ctx, struct bpf_rstat_ctx)
 #endif
 #ifdef CONFIG_BPF_LIRC_MODE2
 BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1bfcfb1af352..3bd6eed1fa13 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -20,6 +20,7 @@
 #include <linux/u64_stats_sync.h>
 #include <linux/workqueue.h>
 #include <linux/bpf-cgroup-defs.h>
+#include <linux/bpf-cgroup-subsys.h>
 #include <linux/psi_types.h>
 
 #ifdef CONFIG_CGROUPS
@@ -706,6 +707,9 @@ struct cgroup_subsys {
 	 * specifies the mask of subsystems that this one depends on.
 	 */
 	unsigned int depends_on;
+
+	/* used to store bpf programs.*/
+	struct cgroup_subsys_bpf bpf;
 };
 
 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d14b10b85e51..0f4855fa85db 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -952,6 +952,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT,
 };
 
 enum bpf_attach_type {
@@ -998,6 +999,7 @@ enum bpf_attach_type {
 	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
 	BPF_PERF_EVENT,
 	BPF_TRACE_KPROBE_MULTI,
+	BPF_CGROUP_SUBSYS_RSTAT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1013,6 +1015,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_XDP = 6,
 	BPF_LINK_TYPE_PERF_EVENT = 7,
 	BPF_LINK_TYPE_KPROBE_MULTI = 8,
+	BPF_LINK_TYPE_CGROUP_SUBSYS = 9,
 
 	MAX_BPF_LINK_TYPE,
 };
@@ -1482,6 +1485,9 @@ union bpf_attr {
 				 */
 				__u64		bpf_cookie;
 			} perf_event;
+			struct {
+				__u64		name;
+			} cgroup_subsys;
 			struct {
 				__u32		flags;
 				__u32		cnt;
@@ -6324,6 +6330,12 @@ struct bpf_cgroup_dev_ctx {
 	__u32 minor;
 };
 
+struct bpf_rstat_ctx {
+	__u64 cgroup_id;
+	__u64 parent_cgroup_id; /* 0 if root */
+	__s32 cpu;
+};
+
 struct bpf_raw_tracepoint_args {
 	__u64 args[0];
 };
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index c1a9be6a4b9f..6caf4a61e543 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -25,6 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
 obj-$(CONFIG_CGROUP_BPF) += cgroup.o
+obj-$(CONFIG_CGROUP_BPF) += cgroup_subsys.o
 ifeq ($(CONFIG_INET),y)
 obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
 endif
diff --git a/kernel/bpf/cgroup_subsys.c b/kernel/bpf/cgroup_subsys.c
new file mode 100644
index 000000000000..9673ce6aa84a
--- /dev/null
+++ b/kernel/bpf/cgroup_subsys.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+
+#include <linux/bpf-cgroup-subsys.h>
+#include <linux/filter.h>
+
+#include "../cgroup/cgroup-internal.h"
+
+
+static int cgroup_subsys_bpf_attach(struct cgroup_subsys *ss, struct bpf_prog *prog)
+{
+	struct bpf_subsys_rstat_flusher *rstat_flusher;
+
+	rstat_flusher = kmalloc(sizeof(*rstat_flusher), GFP_KERNEL);
+	if (!rstat_flusher)
+		return -ENOMEM;
+	rstat_flusher->prog = prog;
+
+	spin_lock(&ss->bpf.flushers_lock);
+	list_add(&rstat_flusher->list, &ss->bpf.rstat_flushers);
+	spin_unlock(&ss->bpf.flushers_lock);
+
+	return 0;
+}
+
+static void cgroup_subsys_bpf_detach(struct cgroup_subsys *ss, struct bpf_prog *prog)
+{
+	struct bpf_subsys_rstat_flusher *rstat_flusher = NULL;
+
+	spin_lock(&ss->bpf.flushers_lock);
+	list_for_each_entry(rstat_flusher, &ss->bpf.rstat_flushers, list)
+		if (rstat_flusher->prog == prog)
+			break;
+
+	if (rstat_flusher) {
+		list_del(&rstat_flusher->list);
+		bpf_prog_put(rstat_flusher->prog);
+		kfree(rstat_flusher);
+	}
+	spin_unlock(&ss->bpf.flushers_lock);
+}
+
+static void bpf_cgroup_subsys_link_release(struct bpf_link *link)
+{
+	struct bpf_cgroup_subsys_link *ss_link = container_of(link,
+						       struct bpf_cgroup_subsys_link,
+						       link);
+	if (ss_link->ss) {
+		cgroup_subsys_bpf_detach(ss_link->ss, ss_link->link.prog);
+		ss_link->ss = NULL;
+	}
+}
+
+static int bpf_cgroup_subsys_link_detach(struct bpf_link *link)
+{
+	bpf_cgroup_subsys_link_release(link);
+	return 0;
+}
+
+static void bpf_cgroup_subsys_link_dealloc(struct bpf_link *link)
+{
+	struct bpf_cgroup_subsys_link *ss_link = container_of(link,
+						       struct bpf_cgroup_subsys_link,
+						       link);
+	kfree(ss_link);
+}
+
+static const struct bpf_link_ops bpf_cgroup_subsys_link_lops = {
+	.detach = bpf_cgroup_subsys_link_detach,
+	.release = bpf_cgroup_subsys_link_release,
+	.dealloc = bpf_cgroup_subsys_link_dealloc,
+};
+
+int cgroup_subsys_bpf_link_attach(const union bpf_attr *attr,
+				  struct bpf_prog *prog)
+{
+	struct bpf_link_primer link_primer;
+	struct bpf_cgroup_subsys_link *link;
+	struct cgroup_subsys *ss, *attach_ss = NULL;
+	const char __user *ss_name_user;
+	char ss_name[MAX_CGROUP_TYPE_NAMELEN];
+	int ssid, err;
+
+	if (attr->link_create.target_fd || attr->link_create.flags)
+		return -EINVAL;
+
+	ss_name_user = u64_to_user_ptr(attr->link_create.cgroup_subsys.name);
+	if (strncpy_from_user(ss_name, ss_name_user, sizeof(ss_name) - 1) < 0)
+		return -EFAULT;
+
+	for_each_subsys(ss, ssid)
+		if (!strcmp(ss_name, ss->name) ||
+		    !strcmp(ss_name, ss->legacy_name))
+			attach_ss = ss;
+
+	if (!attach_ss)
+		return -EINVAL;
+
+	link = kzalloc(sizeof(*link), GFP_USER);
+	if (!link)
+		return -ENOMEM;
+
+	bpf_link_init(&link->link, BPF_LINK_TYPE_CGROUP_SUBSYS,
+		      &bpf_cgroup_subsys_link_lops,
+		      prog);
+	link->ss = attach_ss;
+
+	err = bpf_link_prime(&link->link, &link_primer);
+	if (err) {
+		kfree(link);
+		return err;
+	}
+
+	err = cgroup_subsys_bpf_attach(attach_ss, prog);
+	if (err) {
+		bpf_link_cleanup(&link_primer);
+		return err;
+	}
+
+	return bpf_link_settle(&link_primer);
+}
+
+static const struct bpf_func_proto *
+cgroup_subsys_rstat_func_proto(enum bpf_func_id func_id,
+			       const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id);
+}
+
+static bool cgroup_subsys_rstat_is_valid_access(int off, int size,
+					   enum bpf_access_type type,
+					   const struct bpf_prog *prog,
+					   struct bpf_insn_access_aux *info)
+{
+	if (type == BPF_WRITE)
+		return false;
+
+	if (off < 0 || off + size > sizeof(struct bpf_rstat_ctx))
+		return false;
+	/* The verifier guarantees that size > 0 */
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	case offsetof(struct bpf_rstat_ctx, cgroup_id):
+		return size == sizeof(__u64);
+	case offsetof(struct bpf_rstat_ctx, parent_cgroup_id):
+		return size == sizeof(__u64);
+	case offsetof(struct bpf_rstat_ctx, cpu):
+		return size == sizeof(__s32);
+	default:
+		return false;
+	}
+}
+
+const struct bpf_prog_ops cgroup_subsys_rstat_prog_ops = {
+};
+
+const struct bpf_verifier_ops cgroup_subsys_rstat_verifier_ops = {
+	.get_func_proto         = cgroup_subsys_rstat_func_proto,
+	.is_valid_access        = cgroup_subsys_rstat_is_valid_access,
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cdaa1152436a..48149c54d969 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3,6 +3,7 @@
  */
 #include <linux/bpf.h>
 #include <linux/bpf-cgroup.h>
+#include <linux/bpf-cgroup-subsys.h>
 #include <linux/bpf_trace.h>
 #include <linux/bpf_lirc.h>
 #include <linux/bpf_verifier.h>
@@ -3194,6 +3195,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_SK_LOOKUP;
 	case BPF_XDP:
 		return BPF_PROG_TYPE_XDP;
+	case BPF_CGROUP_SUBSYS_RSTAT:
+		return BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
@@ -4341,6 +4344,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 		else
 			ret = bpf_kprobe_multi_link_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT:
+		ret = cgroup_subsys_bpf_link_attach(attr, prog);
+		break;
 	default:
 		ret = -EINVAL;
 	}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..7b1448013009 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5745,6 +5745,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 
 	idr_init(&ss->css_idr);
 	INIT_LIST_HEAD(&ss->cfts);
+	INIT_LIST_HEAD(&ss->bpf.rstat_flushers);
 
 	/* Create the root cgroup state for this subsystem */
 	ss->root = &cgrp_dfl_root;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d14b10b85e51..0f4855fa85db 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -952,6 +952,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT,
 };
 
 enum bpf_attach_type {
@@ -998,6 +999,7 @@ enum bpf_attach_type {
 	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
 	BPF_PERF_EVENT,
 	BPF_TRACE_KPROBE_MULTI,
+	BPF_CGROUP_SUBSYS_RSTAT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1013,6 +1015,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_XDP = 6,
 	BPF_LINK_TYPE_PERF_EVENT = 7,
 	BPF_LINK_TYPE_KPROBE_MULTI = 8,
+	BPF_LINK_TYPE_CGROUP_SUBSYS = 9,
 
 	MAX_BPF_LINK_TYPE,
 };
@@ -1482,6 +1485,9 @@ union bpf_attr {
 				 */
 				__u64		bpf_cookie;
 			} perf_event;
+			struct {
+				__u64		name;
+			} cgroup_subsys;
 			struct {
 				__u32		flags;
 				__u32		cnt;
@@ -6324,6 +6330,12 @@ struct bpf_cgroup_dev_ctx {
 	__u32 minor;
 };
 
+struct bpf_rstat_ctx {
+	__u64 cgroup_id;
+	__u64 parent_cgroup_id; /* 0 if root */
+	__s32 cpu;
+};
+
 struct bpf_raw_tracepoint_args {
 	__u64 args[0];
 };

From patchwork Tue May 10 00:18:00 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yosry Ahmed <yosryahmed@google.com>
X-Patchwork-Id: 12844366
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9677CC43217
	for <netdev@archiver.kernel.org>; Tue, 10 May 2022 00:18:38 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233627AbiEJAWb (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Mon, 9 May 2022 20:22:31 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33100 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233528AbiEJAW3 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 9 May 2022 20:22:29 -0400
Received: from mail-pj1-x1049.google.com (mail-pj1-x1049.google.com
 [IPv6:2607:f8b0:4864:20::1049])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9A66128B84B
        for <netdev@vger.kernel.org>; Mon,  9 May 2022 17:18:24 -0700 (PDT)
Received: by mail-pj1-x1049.google.com with SMTP id
 z11-20020a17090a468b00b001dc792e8660so381514pjf.1
        for <netdev@vger.kernel.org>; Mon, 09 May 2022 17:18:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=YvkcjuDpXc8HWz9CnYpHQqOHXSlPjrKDHoLv5HHHOqc=;
        b=MJHi55tk0JN9+q5cGwIIOQ+Jj2WwPhMAlGDibUDUwpt7+quof2cyyE79xyhhUwI89Y
         u/5KnrZvb0DUC2kgcpQz7sODufIBs2ts//Wawz6Ms1xJ1I4sqYsyv16Q7y5aAC3h0NKS
         vmiwyBlLIXY6+YfJOCjAA1DFGpm80zmopqy2PFz49B/BD5RQY2S5e6XGp+YRwClg7JUO
         1lKzRYsM7rIFumsFTXllZkAKmnVhEbZ+gbnXS2aTvqKceWxC5Mna88o6Mii+lvWEFOpB
         sqzm24QVieegXsKuhqQUEfwzqr9S6F5QMBxUaMwPU9Wg9eFun1+wFgagTcDmhII9EQsO
         FC2A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=YvkcjuDpXc8HWz9CnYpHQqOHXSlPjrKDHoLv5HHHOqc=;
        b=V+GH9xHLzE/zlHvADwWkvnx+c8gMYHjCjdU3G4SQ2FqlBwfE/L5Eo2cEdT3sVD3cd7
         gkTvXftwkdSt3dagfq+iMP2MU4WEK+KwJf/SxVhrz8MLzIkXSORtIZVlDigLDi3+NWEM
         sr54tN7NExANmylR2OMLVrCm7fLNv6w7+NqPPyWCPemR1zaTiLkaq22YumdThtucujMS
         ZZQt0ntDJ2wNFdrSPqHupflgRsu12kCFePs3ac/GwAViu1TWSnVokSaWT6zrLsTyyXzH
         q+yXH79eKJgDa/7pW6B702YxvNPYz6BGQKZf8T47ERA2/LsVEDZzi4D7hQZgVqmgqw9i
         I2Tw==
X-Gm-Message-State: AOAM533dpELhxDdWl/OpxQZd9XGvqsR4bgCUKgSR/BHUMvcBwldvU0f4
        zYXeqzmCYnFGKrFuGwZ+XgNuK7juTPTRHl9Z
X-Google-Smtp-Source: 
 ABdhPJyichTNAYcXj0dlN7pnuR9xWFr0whwiPnqKBK4VX98J1jmH6Taa+gu5jRuIr97cvSnAyVcnQV8xNuZ9Rbo1
X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327])
 (user=yosryahmed job=sendgmr) by 2002:a05:6a00:b41:b0:50d:35fa:476d with SMTP
 id p1-20020a056a000b4100b0050d35fa476dmr18246948pfo.33.1652141904069; Mon, 09
 May 2022 17:18:24 -0700 (PDT)
Date: Tue, 10 May 2022 00:18:00 +0000
In-Reply-To: <20220510001807.4132027-1-yosryahmed@google.com>
Message-Id: <20220510001807.4132027-3-yosryahmed@google.com>
Mime-Version: 1.0
References: <20220510001807.4132027-1-yosryahmed@google.com>
X-Mailer: git-send-email 2.36.0.512.ge40c2bad7a-goog
Subject: [RFC PATCH bpf-next 2/9] cgroup: bpf: flush bpf stats on rstat flush
From: Yosry Ahmed <yosryahmed@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <kafai@fb.com>,
        Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>, Hao Luo <haoluo@google.com>,
        Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Shuah Khan <shuah@kernel.org>,
        Roman Gushchin <roman.gushchin@linux.dev>,
        Michal Hocko <mhocko@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>,
        David Rientjes <rientjes@google.com>,
        Greg Thelen <gthelen@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
        bpf@vger.kernel.org, cgroups@vger.kernel.org,
        Yosry Ahmed <yosryahmed@google.com>
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

When a cgroup is popped from the rstat updated tree, subsystems rstat
flushers are run through the css_rstat_flush() callback. Also run bpf
flushers for all subsystems that have at least one bpf rstat flusher
attached, and are enabled for this cgroup.

A list of subsystems that have attached rstat flushers is maintained to
avoid looping through all subsystems for all cpus for every cgroup that
is being popped from the updated tree. Since we introduce a lock here to
protect this list, also use it to protect rstat_flushers lists inside
each subsystem (since they both need to locked together anyway), and get
read of the locks in struct cgroup_subsys_bpf.

rstat flushers are run for any enabled subsystem that has flushers
attached, even if it does not subscribe to css flushing through
css_rstat_flush(). This gives flexibility for bpf programs to collect
stats for any subsystem, regardless of the implementation changes in the
kernel.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf-cgroup-subsys.h |  7 +++-
 include/linux/cgroup.h            |  2 ++
 kernel/bpf/cgroup_subsys.c        | 60 +++++++++++++++++++++++++++----
 kernel/cgroup/cgroup.c            |  5 +--
 kernel/cgroup/rstat.c             | 11 ++++++
 5 files changed, 75 insertions(+), 10 deletions(-)

diff --git a/include/linux/bpf-cgroup-subsys.h b/include/linux/bpf-cgroup-subsys.h
index 4dcde06b5599..e977b9ef5754 100644
--- a/include/linux/bpf-cgroup-subsys.h
+++ b/include/linux/bpf-cgroup-subsys.h
@@ -10,7 +10,11 @@
 struct cgroup_subsys_bpf {
 	/* Head of the list of BPF rstat flushers attached to this subsystem */
 	struct list_head rstat_flushers;
-	spinlock_t flushers_lock;
+	/*
+	 * A list that runs through subsystems that have at least one rstat
+	 * flusher.
+	 */
+	struct list_head rstat_subsys_node;
 };
 
 struct bpf_subsys_rstat_flusher {
@@ -26,5 +30,6 @@ struct bpf_cgroup_subsys_link {
 
 int cgroup_subsys_bpf_link_attach(const union bpf_attr *attr,
 				  struct bpf_prog *prog);
+void bpf_run_rstat_flushers(struct cgroup *cgrp, int cpu);
 
 #endif  // _BPF_CGROUP_SUBSYS_H_
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0d1ada8968d7..5408c74d5c44 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -97,6 +97,8 @@ extern struct css_set init_css_set;
 
 bool css_has_online_children(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss);
+struct cgroup_subsys_state *cgroup_css(struct cgroup *cgrp,
+				       struct cgroup_subsys *ss);
 struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgroup,
 					 struct cgroup_subsys *ss);
 struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup,
diff --git a/kernel/bpf/cgroup_subsys.c b/kernel/bpf/cgroup_subsys.c
index 9673ce6aa84a..1d10319a34e9 100644
--- a/kernel/bpf/cgroup_subsys.c
+++ b/kernel/bpf/cgroup_subsys.c
@@ -6,10 +6,46 @@
  */
 
 #include <linux/bpf-cgroup-subsys.h>
+#include <linux/cgroup.h>
 #include <linux/filter.h>
 
 #include "../cgroup/cgroup-internal.h"
 
+/* List of subsystems that have rstat flushers attached */
+static LIST_HEAD(bpf_rstat_subsys_list);
+/* Protects the above list, and the lists of rstat flushers in each subsys */
+static DEFINE_SPINLOCK(bpf_rstat_subsys_lock);
+
+
+void bpf_run_rstat_flushers(struct cgroup *cgrp, int cpu)
+{
+	struct cgroup_subsys_bpf *ss_bpf;
+	struct cgroup *parent = cgroup_parent(cgrp);
+	struct bpf_rstat_ctx ctx = {
+		.cgroup_id = cgroup_id(cgrp),
+		.parent_cgroup_id = parent ? cgroup_id(parent) : 0,
+		.cpu = cpu,
+	};
+
+	rcu_read_lock();
+	migrate_disable();
+	spin_lock(&bpf_rstat_subsys_lock);
+	list_for_each_entry(ss_bpf, &bpf_rstat_subsys_list, rstat_subsys_node) {
+		struct bpf_subsys_rstat_flusher *rstat_flusher;
+		struct cgroup_subsys *ss = container_of(ss_bpf,
+							struct cgroup_subsys,
+							bpf);
+
+		/* Subsystem ss is not enabled for cgrp */
+		if (!cgroup_css(cgrp, ss))
+			continue;
+		list_for_each_entry(rstat_flusher, &ss_bpf->rstat_flushers, list)
+			(void) bpf_prog_run(rstat_flusher->prog, &ctx);
+	}
+	spin_unlock(&bpf_rstat_subsys_lock);
+	migrate_enable();
+	rcu_read_unlock();
+}
 
 static int cgroup_subsys_bpf_attach(struct cgroup_subsys *ss, struct bpf_prog *prog)
 {
@@ -20,28 +56,38 @@ static int cgroup_subsys_bpf_attach(struct cgroup_subsys *ss, struct bpf_prog *p
 		return -ENOMEM;
 	rstat_flusher->prog = prog;
 
-	spin_lock(&ss->bpf.flushers_lock);
+	spin_lock(&bpf_rstat_subsys_lock);
+	/* Add ss to bpf_rstat_subsys_list when we attach the first flusher */
+	if (list_empty(&ss->bpf.rstat_flushers))
+		list_add(&ss->bpf.rstat_subsys_node, &bpf_rstat_subsys_list);
 	list_add(&rstat_flusher->list, &ss->bpf.rstat_flushers);
-	spin_unlock(&ss->bpf.flushers_lock);
+	spin_unlock(&bpf_rstat_subsys_lock);
 
 	return 0;
 }
 
 static void cgroup_subsys_bpf_detach(struct cgroup_subsys *ss, struct bpf_prog *prog)
 {
-	struct bpf_subsys_rstat_flusher *rstat_flusher = NULL;
+	struct bpf_subsys_rstat_flusher *iter, *rstat_flusher = NULL;
 
-	spin_lock(&ss->bpf.flushers_lock);
-	list_for_each_entry(rstat_flusher, &ss->bpf.rstat_flushers, list)
-		if (rstat_flusher->prog == prog)
+	spin_lock(&bpf_rstat_subsys_lock);
+	list_for_each_entry(iter, &ss->bpf.rstat_flushers, list)
+		if (iter->prog == prog) {
+			rstat_flusher = iter;
 			break;
+		}
 
 	if (rstat_flusher) {
 		list_del(&rstat_flusher->list);
 		bpf_prog_put(rstat_flusher->prog);
 		kfree(rstat_flusher);
 	}
-	spin_unlock(&ss->bpf.flushers_lock);
+	/*
+	 * Remove ss from bpf_rstat_subsys_list when we detach the last flusher
+	 */
+	if (list_empty(&ss->bpf.rstat_flushers))
+		list_del(&ss->bpf.rstat_subsys_node);
+	spin_unlock(&bpf_rstat_subsys_lock);
 }
 
 static void bpf_cgroup_subsys_link_release(struct bpf_link *link)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7b1448013009..af703cfcb9d2 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -478,8 +478,8 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp)
  * keep accessing it outside the said locks.  This function may return
  * %NULL if @cgrp doesn't have @subsys_id enabled.
  */
-static struct cgroup_subsys_state *cgroup_css(struct cgroup *cgrp,
-					      struct cgroup_subsys *ss)
+struct cgroup_subsys_state *cgroup_css(struct cgroup *cgrp,
+				       struct cgroup_subsys *ss)
 {
 	if (CGROUP_HAS_SUBSYS_CONFIG && ss)
 		return rcu_dereference_check(cgrp->subsys[ss->id],
@@ -5746,6 +5746,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	idr_init(&ss->css_idr);
 	INIT_LIST_HEAD(&ss->cfts);
 	INIT_LIST_HEAD(&ss->bpf.rstat_flushers);
+	INIT_LIST_HEAD(&ss->bpf.rstat_subsys_node);
 
 	/* Create the root cgroup state for this subsystem */
 	ss->root = &cgrp_dfl_root;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 24b5c2ab5598..af553a0ccc0d 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -2,6 +2,7 @@
 #include "cgroup-internal.h"
 
 #include <linux/sched/cputime.h>
+#include <linux/bpf-cgroup-subsys.h>
 
 static DEFINE_SPINLOCK(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
@@ -173,6 +174,16 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
 			list_for_each_entry_rcu(css, &pos->rstat_css_list,
 						rstat_css_node)
 				css->ss->css_rstat_flush(css, cpu);
+			/*
+			 * We run bpf flushers in a separate loop in
+			 * bpf_run_rstat_flushers()  as the above
+			 * loop only goes through subsystems that have rstat
+			 * flushing registered in the kernel.
+			 *
+			 * This gives flexibility for BPF programs to utilize
+			 * rstat to collect stats for any subsystem.
+			 */
+			bpf_run_rstat_flushers(pos, cpu);
 			rcu_read_unlock();
 		}
 		raw_spin_unlock_irqrestore(cpu_lock, flags);

From patchwork Tue May 10 00:18:01 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yosry Ahmed <yosryahmed@google.com>
X-Patchwork-Id: 12844367
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 29B73C433EF
	for <netdev@archiver.kernel.org>; Tue, 10 May 2022 00:18:41 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233513AbiEJAWc (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Mon, 9 May 2022 20:22:32 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33110 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233546AbiEJAW3 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 9 May 2022 20:22:29 -0400
Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com
 [IPv6:2607:f8b0:4864:20::449])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 56F4628C9C8
        for <netdev@vger.kernel.org>; Mon,  9 May 2022 17:18:26 -0700 (PDT)
Received: by mail-pf1-x449.google.com with SMTP id
 c202-20020a621cd3000000b0050dd228152aso5399750pfc.11
        for <netdev@vger.kernel.org>; Mon, 09 May 2022 17:18:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=jUygyQcHmQCX+4t0PO9Lj98oGxUhwQRCqcIwRtRRuu0=;
        b=b9L6wIY6W3JHgNc21GkxYFtPGgfmo5QUJKkjlOhi4hW3quxxLCFiGsyjlcSg3mOEte
         xmYcnYmek/bOQXhLz+sI4VQviMEA+cmcQ45gAHFT3NM89PNw3VF2taoJOECrNh688Ieh
         XVEE6g3VTkOhPesLqypGO5fy0bGZ8/jL+ZYAYtCCGQJukNlJAI77UUs0ewH/ySwBGf6S
         euh6YXkx+e0I2eJNJCkjlpGpu/IgzPTDz31AoWxHLCsaa08vNcDkWlzYVZ3WX+++HFY+
         JMd+L9/ZBEFMpxJLXdRZehKNGnT/yoi38iBjacCWDEVlkD8M9h/fuKz/vEF7G+xhBP93
         CS6w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=jUygyQcHmQCX+4t0PO9Lj98oGxUhwQRCqcIwRtRRuu0=;
        b=EAbSUhfgN+z5LsqnzEOQQetrogoZPcqIqLOw7wpHD31hHizIgc+gZR8RiXj7JuOjZ8
         zDLIZFQ1p0PzrDfqO7vVDsb7Koo6nWPDUtrIxnmnbTrNn9Vee0TJl1yfoDobFWt4Cl5T
         gXIhhWbCglUMGtthJk82+TYqeUpCFoAHCAgRBVKD9wkyuJvJzUC7apRyJ0LTwVTjLFHb
         9IqOWMd0TvzYoFcBS2GaVSpEa6IwIj5x6FVEW2YPYwcEpu7x5kGDjuVaf+Bd2wGdUP93
         U925DDJZLaLXy96sBSytVaMacjBEdvCCl70BwZsSZXsu1ycBgFQ8rM05MxL/FzRUGCs0
         syUA==
X-Gm-Message-State: AOAM532MbUzgs1MHCm3crlM5gECzhHzRwOIrzuDU2bFAiU9cde5nZ9CY
        a//suepea97JEhy4F8xxObIG8PmomgPDIRhl
X-Google-Smtp-Source: 
 ABdhPJws2wA6fi+zGQOzH1vWd8zNdFXSGIXiWcInZijSRi9ZraL3D7PhFIBq8d5am7hWKt6LV39Xxn5hy8O21P2z
X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327])
 (user=yosryahmed job=sendgmr) by 2002:a05:6a00:846:b0:50d:f02f:bb46 with SMTP
 id q6-20020a056a00084600b0050df02fbb46mr18110972pfk.74.1652141905673; Mon, 09
 May 2022 17:18:25 -0700 (PDT)
Date: Tue, 10 May 2022 00:18:01 +0000
In-Reply-To: <20220510001807.4132027-1-yosryahmed@google.com>
Message-Id: <20220510001807.4132027-4-yosryahmed@google.com>
Mime-Version: 1.0
References: <20220510001807.4132027-1-yosryahmed@google.com>
X-Mailer: git-send-email 2.36.0.512.ge40c2bad7a-goog
Subject: [RFC PATCH bpf-next 3/9] libbpf: Add support for rstat progs and
 links
From: Yosry Ahmed <yosryahmed@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <kafai@fb.com>,
        Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>, Hao Luo <haoluo@google.com>,
        Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Shuah Khan <shuah@kernel.org>,
        Roman Gushchin <roman.gushchin@linux.dev>,
        Michal Hocko <mhocko@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>,
        David Rientjes <rientjes@google.com>,
        Greg Thelen <gthelen@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
        bpf@vger.kernel.org, cgroups@vger.kernel.org,
        Yosry Ahmed <yosryahmed@google.com>
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

Add support to attach "cgroup_subsys/rstat" programs to a subsystem by
calling bpf_program__attach_subsys. Currently, only CGROUP_SUBSYS_RSTAT
programs are supported for attachment to subsystems.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 tools/lib/bpf/bpf.c      |  3 +++
 tools/lib/bpf/bpf.h      |  3 +++
 tools/lib/bpf/libbpf.c   | 35 +++++++++++++++++++++++++++++++++++
 tools/lib/bpf/libbpf.h   |  3 +++
 tools/lib/bpf/libbpf.map |  1 +
 5 files changed, 45 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index cf27251adb92..abfff17cfa07 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -863,6 +863,9 @@ int bpf_link_create(int prog_fd, int target_fd,
 		if (!OPTS_ZEROED(opts, kprobe_multi))
 			return libbpf_err(-EINVAL);
 		break;
+	case BPF_CGROUP_SUBSYS_RSTAT:
+		attr.link_create.cgroup_subsys.name = ptr_to_u64(OPTS_GET(opts, cgroup_subsys.name, 0));
+		break;
 	default:
 		if (!OPTS_ZEROED(opts, flags))
 			return libbpf_err(-EINVAL);
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index f4b4afb6d4ba..384767a9ffd3 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -413,6 +413,9 @@ struct bpf_link_create_opts {
 		struct {
 			__u64 bpf_cookie;
 		} perf_event;
+		struct {
+			const char *name;
+		} cgroup_subsys;
 		struct {
 			__u32 flags;
 			__u32 cnt;
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 809fe209cdcc..56380953df55 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8715,6 +8715,7 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("cgroup/setsockopt",	CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
 	SEC_DEF("struct_ops+",		STRUCT_OPS, 0, SEC_NONE),
 	SEC_DEF("sk_lookup",		SK_LOOKUP, BPF_SK_LOOKUP, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
+	SEC_DEF("cgroup_subsys/rstat",	CGROUP_SUBSYS_RSTAT, 0, SEC_NONE),
 };
 
 static size_t custom_sec_def_cnt;
@@ -10957,6 +10958,40 @@ static int attach_iter(const struct bpf_program *prog, long cookie, struct bpf_l
 	return libbpf_get_error(*link);
 }
 
+struct bpf_link *bpf_program__attach_subsys(const struct bpf_program *prog,
+					     const char *subsys_name)
+{
+	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, lopts,
+			    .cgroup_subsys.name = subsys_name);
+	struct bpf_link *link = NULL;
+	char errmsg[STRERR_BUFSIZE];
+	int err, prog_fd, link_fd;
+
+	prog_fd = bpf_program__fd(prog);
+	if (prog_fd < 0) {
+		pr_warn("prog '%s': can't attach before loaded\n", prog->name);
+		return libbpf_err_ptr(-EINVAL);
+	}
+
+	link = calloc(1, sizeof(*link));
+	if (!link)
+		return libbpf_err_ptr(-ENOMEM);
+	link->detach = &bpf_link__detach_fd;
+
+	link_fd = bpf_link_create(prog_fd, 0, BPF_CGROUP_SUBSYS_RSTAT, &lopts);
+	if (link_fd < 0) {
+		err = -errno;
+		pr_warn("prog '%s': failed to attach: %s\n",
+			prog->name, libbpf_strerror_r(err, errmsg,
+						      sizeof(errmsg)));
+		free(link);
+		return libbpf_err_ptr(err);
+	}
+
+	link->fd = link_fd;
+	return link;
+}
+
 struct bpf_link *bpf_program__attach(const struct bpf_program *prog)
 {
 	struct bpf_link *link = NULL;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 05dde85e19a6..eddbffcd39f7 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -537,6 +537,9 @@ bpf_program__attach_xdp(const struct bpf_program *prog, int ifindex);
 LIBBPF_API struct bpf_link *
 bpf_program__attach_freplace(const struct bpf_program *prog,
 			     int target_fd, const char *attach_func_name);
+LIBBPF_API struct bpf_link *
+bpf_program__attach_subsys(const struct bpf_program *prog,
+			   const char *subsys_name);
 
 struct bpf_map;
 
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index dd35ee58bfaa..5583a2dbfb7c 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -447,4 +447,5 @@ LIBBPF_0.8.0 {
 		libbpf_register_prog_handler;
 		libbpf_unregister_prog_handler;
 		bpf_program__attach_kprobe_multi_opts;
+		bpf_program__attach_subsys;
 } LIBBPF_0.7.0;

From patchwork Tue May 10 00:18:02 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yosry Ahmed <yosryahmed@google.com>
X-Patchwork-Id: 12844373
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2E046C433EF
	for <netdev@archiver.kernel.org>; Tue, 10 May 2022 00:20:04 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233570AbiEJAXI (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Mon, 9 May 2022 20:23:08 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33102 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233560AbiEJAW3 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 9 May 2022 20:22:29 -0400
Received: from mail-pg1-x549.google.com (mail-pg1-x549.google.com
 [IPv6:2607:f8b0:4864:20::549])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D566228C9DE
        for <netdev@vger.kernel.org>; Mon,  9 May 2022 17:18:27 -0700 (PDT)
Received: by mail-pg1-x549.google.com with SMTP id
 d127-20020a633685000000b003ab20e589a8so8042137pga.22
        for <netdev@vger.kernel.org>; Mon, 09 May 2022 17:18:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=KcVlDhYIqiGAy0xA6Ny1iEWIuZrHDDkBi7OMZAuyAKA=;
        b=CbFDRQM+DJHPuLYLBakMiPuPtWC2dPz39UZfOmcu4qup9XyxZPCJINo69q4jZT4AoC
         dLRunNzKgmtg9/SjwCmLiNW5Q87qDk2kiCE3Bvr3D+WjbNu8Syu3BhuMYSNbGa207ILF
         Qh7apNc/4HgAHnDy45R3N6ppLnbRqI6AWkGDHHxN3ezenAc++v64jfbpcV8RvtKTW5OW
         H5cK0SaNA8KayXojnjkq3df/p9zd5sCw6q/pivecBUNQfPA6gTBNzpdlsha/+JCX+EP2
         b27F5hHeQOk9I6B+LWTtNHptDhwCCBC3wK41+O8Y1dY+/ywCr06lqwmPK5GvIEQsBDUo
         dI5w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=KcVlDhYIqiGAy0xA6Ny1iEWIuZrHDDkBi7OMZAuyAKA=;
        b=0gHTEkzxA0gQUk3BdQbofxhidD5SaMipwJwJYcskWC+zlRVdKN7tUPysWQsj0O9//1
         CLS+i0XlgBOoDCgB41DuEVYHr6KZmcNzq0/qDTXhcdUsIS8oiVi7mHuf5KG50+zlxVWL
         1chQ53pqrLzcc66nJqs9mxtBVdLBYQBzknlGOgyi+dARI7187UMaq3I1EP+PF8lDPoWp
         3TZ5eEQyXnuwNkzZtzHhjNuL3OevKTT+gqciu074LLVocVsCfWPE1MwPXKE45QOQmxxU
         3hAInDhfCOy0vs457QLo8zRK1NeUHE0M6vZMbuVPPMi3AXpaJr+dn7xOUVbxctGWheeI
         jXqQ==
X-Gm-Message-State: AOAM532IUaOM+lW+X/U3uwyJLMQQcaq+YeUKIvMsOZHfBmboOD1s8MlK
        noCw8jTL0YDGWab3npDkAw98oaLi7CAmEp6q
X-Google-Smtp-Source: 
 ABdhPJzYYkjJ3ZLtAxYLy2pZUwKzXH4+bOWrXeRHBvANSFiGZs994Oo4pdOYa15sJhMjmUsYw0Y5YSKbOxkEbCUX
X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327])
 (user=yosryahmed job=sendgmr) by 2002:a17:90b:2311:b0:1d9:277e:edad with SMTP
 id mt17-20020a17090b231100b001d9277eedadmr20549707pjb.190.1652141907370; Mon,
 09 May 2022 17:18:27 -0700 (PDT)
Date: Tue, 10 May 2022 00:18:02 +0000
In-Reply-To: <20220510001807.4132027-1-yosryahmed@google.com>
Message-Id: <20220510001807.4132027-5-yosryahmed@google.com>
Mime-Version: 1.0
References: <20220510001807.4132027-1-yosryahmed@google.com>
X-Mailer: git-send-email 2.36.0.512.ge40c2bad7a-goog
Subject: [RFC PATCH bpf-next 4/9] bpf: add bpf rstat helpers
From: Yosry Ahmed <yosryahmed@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <kafai@fb.com>,
        Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>, Hao Luo <haoluo@google.com>,
        Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Shuah Khan <shuah@kernel.org>,
        Roman Gushchin <roman.gushchin@linux.dev>,
        Michal Hocko <mhocko@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>,
        David Rientjes <rientjes@google.com>,
        Greg Thelen <gthelen@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
        bpf@vger.kernel.org, cgroups@vger.kernel.org,
        Yosry Ahmed <yosryahmed@google.com>
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

Add bpf_cgroup_rstat_updated() and bpf_cgroup_rstat_flush() helpers
to enable  bpf programs that collect and output cgroup stats
to communicate with the rstat frameworkto add a cgroup to the rstat
updated tree or trigger an rstat flush before reading stats.

ARG_ANYTHING is used here for the struct *cgroup parameter. Would it be
better to add a task_cgroup(subsys_id) helper that returns a cgroup
pointer so that we can use a BTF argument instead?

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/uapi/linux/bpf.h       | 18 ++++++++++++++++++
 kernel/bpf/helpers.c           | 30 ++++++++++++++++++++++++++++++
 scripts/bpf_doc.py             |  2 ++
 tools/include/uapi/linux/bpf.h | 18 ++++++++++++++++++
 4 files changed, 68 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0f4855fa85db..fce5535579d6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5149,6 +5149,22 @@ union bpf_attr {
  *		The **hash_algo** is returned on success,
  *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
  *		invalid arguments are passed.
+ *
+ * void bpf_cgroup_rstat_updated(struct cgroup *cgrp)
+ *	Description
+ *		Notify the rstat framework that bpf stats were updated for
+ *		*cgrp* on the current cpu. Directly calls cgroup_rstat_updated
+ *		with the given *cgrp* and the current cpu.
+ *	Return
+ *		0
+ *
+ * void bpf_cgroup_rstat_flush(struct cgroup *cgrp)
+ *	Description
+ *		Collect all per-cpu stats in *cgrp*'s subtree into global
+ *		counters and propagate them upwards. Directly calls
+ *		cgroup_rstat_flush_irqsafe with the given *cgrp*.
+ *	Return
+ *		0
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5345,6 +5361,8 @@ union bpf_attr {
 	FN(copy_from_user_task),	\
 	FN(skb_set_tstamp),		\
 	FN(ima_file_hash),		\
+	FN(cgroup_rstat_updated),	\
+	FN(cgroup_rstat_flush),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 315053ef6a75..d124eed97ad7 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1374,6 +1374,32 @@ void bpf_timer_cancel_and_free(void *val)
 	kfree(t);
 }
 
+BPF_CALL_1(bpf_cgroup_rstat_updated, struct cgroup *, cgrp)
+{
+	cgroup_rstat_updated(cgrp, smp_processor_id());
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_cgroup_rstat_updated_proto = {
+	.func		= bpf_cgroup_rstat_updated,
+	.gpl_only	= false,
+	.ret_type	= RET_VOID,
+	.arg1_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_1(bpf_cgroup_rstat_flush, struct cgroup *, cgrp)
+{
+	cgroup_rstat_flush_irqsafe(cgrp);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_cgroup_rstat_flush_proto = {
+	.func		= bpf_cgroup_rstat_flush,
+	.gpl_only	= false,
+	.ret_type	= RET_VOID,
+	.arg1_type	= ARG_ANYTHING,
+};
+
 const struct bpf_func_proto bpf_get_current_task_proto __weak;
 const struct bpf_func_proto bpf_get_current_task_btf_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_proto __weak;
@@ -1426,6 +1452,10 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_loop_proto;
 	case BPF_FUNC_strncmp:
 		return &bpf_strncmp_proto;
+	case BPF_FUNC_cgroup_rstat_updated:
+		return &bpf_cgroup_rstat_updated_proto;
+	case BPF_FUNC_cgroup_rstat_flush:
+		return &bpf_cgroup_rstat_flush_proto;
 	default:
 		break;
 	}
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index 096625242475..9e2b08557a6f 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -633,6 +633,7 @@ class PrinterHelpers(Printer):
             'struct socket',
             'struct file',
             'struct bpf_timer',
+            'struct cgroup',
     ]
     known_types = {
             '...',
@@ -682,6 +683,7 @@ class PrinterHelpers(Printer):
             'struct socket',
             'struct file',
             'struct bpf_timer',
+            'struct cgroup',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0f4855fa85db..fce5535579d6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5149,6 +5149,22 @@ union bpf_attr {
  *		The **hash_algo** is returned on success,
  *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
  *		invalid arguments are passed.
+ *
+ * void bpf_cgroup_rstat_updated(struct cgroup *cgrp)
+ *	Description
+ *		Notify the rstat framework that bpf stats were updated for
+ *		*cgrp* on the current cpu. Directly calls cgroup_rstat_updated
+ *		with the given *cgrp* and the current cpu.
+ *	Return
+ *		0
+ *
+ * void bpf_cgroup_rstat_flush(struct cgroup *cgrp)
+ *	Description
+ *		Collect all per-cpu stats in *cgrp*'s subtree into global
+ *		counters and propagate them upwards. Directly calls
+ *		cgroup_rstat_flush_irqsafe with the given *cgrp*.
+ *	Return
+ *		0
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5345,6 +5361,8 @@ union bpf_attr {
 	FN(copy_from_user_task),	\
 	FN(skb_set_tstamp),		\
 	FN(ima_file_hash),		\
+	FN(cgroup_rstat_updated),	\
+	FN(cgroup_rstat_flush),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper

From patchwork Tue May 10 00:18:03 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yosry Ahmed <yosryahmed@google.com>
X-Patchwork-Id: 12844369
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6C40CC4332F
	for <netdev@archiver.kernel.org>; Tue, 10 May 2022 00:18:46 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233482AbiEJAWi (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Mon, 9 May 2022 20:22:38 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33112 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233600AbiEJAWa (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 9 May 2022 20:22:30 -0400
Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com
 [IPv6:2607:f8b0:4864:20::64a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 437D628C9FF
        for <netdev@vger.kernel.org>; Mon,  9 May 2022 17:18:30 -0700 (PDT)
Received: by mail-pl1-x64a.google.com with SMTP id
 ij27-20020a170902ab5b00b0015d41282214so9025874plb.9
        for <netdev@vger.kernel.org>; Mon, 09 May 2022 17:18:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=PLe94BZhE80q/pDQVvt5DmK8lXFEEEfCjMN3zuEnNPw=;
        b=b2rDa5+JQZQE63OVUadnzrDA9sRT5AyfXJyIIB31LrDJUWYPfnBVOfa5ojXs9imp6d
         DwF4Awnj/fUwKBhjAsgQY8m/yjkmSpC8h+esfZ2DRrbYeFr9Z/TmnLxUNToiDS3GuWhL
         I1CZcBuzvF+TZLDnC8IEwjt7q1mTj3TCig5j9p1Ib8ZD2Ruj1aVOM/uCMLOLMW0lhB+V
         JFmWTSNEtcFtJ3hkLXygufK/9TBzOJNp+Fh6TscWo5hSpcL6vJVLjYxwh5He67lJ6S5P
         g9h25soVq8JfMXLQ4xTp/9u3XVMKLP8/QyHCB1uSKJk2RK02ZF3mB9ksJ9Py4vK/Ika7
         3aDA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=PLe94BZhE80q/pDQVvt5DmK8lXFEEEfCjMN3zuEnNPw=;
        b=SWMaSeFfTB0sHNZ7PDTU3ZgAA5CuXRl2JYWw3c+LZ3HORajsHWxqlHZNqL9B0HdToj
         BgqZKQhr5YSOclVwL0p335ljmPvb5gIUy0tv5RO2dYe8q5iJD8Upfca1ds0akF/E30DY
         NySY8XB6xncFQqmMGhSNP70heZ1tLK3QKbGtnQGmFnpgC4sIntDVS9tE+MyrC8li8d0N
         mH+zCVeOqds9v9za4/JqF9hAjwVB0TT7ndGrWANnZsfBEPSEYMtJ9CddgNYsBvjkOpwB
         wSzzVEJdUUAb2NlPnLH4seVHMD/XACCBAle7DBuhutd/u8r82FWhEbsD6aF/zmiXQ/XX
         URAQ==
X-Gm-Message-State: AOAM533pIf6kA1u3Nyc3Y9XfWx5i4HoJ6+KxZEePqGI/XQtPK0Ts7l+X
        5aPzgbkHGvPFaMyDaOt7T5aiAoKO/AEy/rDA
X-Google-Smtp-Source: 
 ABdhPJz/zhDfC/scR3LASeLDbHzHW221VoVK/8Sqq0+Z+mT9XUaQdCel9UEi8YE3fJyduB/0tUYYEcm0Ld+tn4Li
X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327])
 (user=yosryahmed job=sendgmr) by 2002:a17:90a:e510:b0:1d9:ee23:9fa1 with SMTP
 id t16-20020a17090ae51000b001d9ee239fa1mr16845pjy.0.1652141908990; Mon, 09
 May 2022 17:18:28 -0700 (PDT)
Date: Tue, 10 May 2022 00:18:03 +0000
In-Reply-To: <20220510001807.4132027-1-yosryahmed@google.com>
Message-Id: <20220510001807.4132027-6-yosryahmed@google.com>
Mime-Version: 1.0
References: <20220510001807.4132027-1-yosryahmed@google.com>
X-Mailer: git-send-email 2.36.0.512.ge40c2bad7a-goog
Subject: [RFC PATCH bpf-next 5/9] bpf: add bpf_map_lookup_percpu_elem() helper
From: Yosry Ahmed <yosryahmed@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <kafai@fb.com>,
        Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>, Hao Luo <haoluo@google.com>,
        Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Shuah Khan <shuah@kernel.org>,
        Roman Gushchin <roman.gushchin@linux.dev>,
        Michal Hocko <mhocko@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>,
        David Rientjes <rientjes@google.com>,
        Greg Thelen <gthelen@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
        bpf@vger.kernel.org, cgroups@vger.kernel.org,
        Yosry Ahmed <yosryahmed@google.com>
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

Add a helper for bpf programs to lookup a percpu element for a cpu other
than the current one. This is useful for rstat flusher programs as they
get called to aggregate stats from different cpus, regardless of the
current cpu.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf.h            |  2 ++
 include/uapi/linux/bpf.h       |  9 +++++++++
 kernel/bpf/arraymap.c          | 11 ++++++++---
 kernel/bpf/hashtab.c           | 25 +++++++++++--------------
 kernel/bpf/helpers.c           | 26 ++++++++++++++++++++++++++
 kernel/bpf/verifier.c          |  6 ++++++
 tools/include/uapi/linux/bpf.h |  9 +++++++++
 7 files changed, 71 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bdb5298735ce..f6fa35ffe311 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1665,6 +1665,8 @@ int map_set_for_each_callback_args(struct bpf_verifier_env *env,
 				   struct bpf_func_state *caller,
 				   struct bpf_func_state *callee);
 
+void *bpf_percpu_hash_lookup(struct bpf_map *map, void *key, int cpu);
+void *bpf_percpu_array_lookup(struct bpf_map *map, void *key, int cpu);
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fce5535579d6..015ed402c642 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1553,6 +1553,14 @@ union bpf_attr {
  * 		Map value associated to *key*, or **NULL** if no entry was
  * 		found.
  *
+ * void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, int cpu)
+ *	Description
+ *		Perform a lookup in percpu *map* for an entry associated to
+ *		*key* for the given *cpu*.
+ *	Return
+ *		Map value associated to *key* per *cpu*, or **NULL** if no entry
+ *		was found.
+ *
  * long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
  * 	Description
  * 		Add or update the value of the entry associated to *key* in
@@ -5169,6 +5177,7 @@ union bpf_attr {
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
 	FN(map_lookup_elem),		\
+	FN(map_lookup_percpu_elem),	\
 	FN(map_update_elem),		\
 	FN(map_delete_elem),		\
 	FN(probe_read),			\
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 7f145aefbff8..945dae4c20eb 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -230,8 +230,7 @@ static int array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
 	return insn - insn_buf;
 }
 
-/* Called from eBPF program */
-static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key)
+void *bpf_percpu_array_lookup(struct bpf_map *map, void *key, int cpu)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	u32 index = *(u32 *)key;
@@ -239,7 +238,13 @@ static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key)
 	if (unlikely(index >= array->map.max_entries))
 		return NULL;
 
-	return this_cpu_ptr(array->pptrs[index & array->index_mask]);
+	return per_cpu_ptr(array->pptrs[index & array->index_mask], cpu);
+}
+
+/* Called from eBPF program */
+static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return bpf_percpu_array_lookup(map, key, smp_processor_id());
 }
 
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 65877967f414..c6d4699d65e8 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -2150,27 +2150,24 @@ const struct bpf_map_ops htab_lru_map_ops = {
 	.iter_seq_info = &iter_seq_info,
 };
 
-/* Called from eBPF program */
-static void *htab_percpu_map_lookup_elem(struct bpf_map *map, void *key)
+void *bpf_percpu_hash_lookup(struct bpf_map *map, void *key, int cpu)
 {
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 	struct htab_elem *l = __htab_map_lookup_elem(map, key);
 
-	if (l)
-		return this_cpu_ptr(htab_elem_get_ptr(l, map->key_size));
+	if (l) {
+		if (htab_is_lru(htab))
+			bpf_lru_node_set_ref(&l->lru_node);
+		return per_cpu_ptr(htab_elem_get_ptr(l, map->key_size), cpu);
+	}
 	else
 		return NULL;
 }
 
-static void *htab_lru_percpu_map_lookup_elem(struct bpf_map *map, void *key)
+/* Called from eBPF program */
+static void *htab_percpu_map_lookup_elem(struct bpf_map *map, void *key)
 {
-	struct htab_elem *l = __htab_map_lookup_elem(map, key);
-
-	if (l) {
-		bpf_lru_node_set_ref(&l->lru_node);
-		return this_cpu_ptr(htab_elem_get_ptr(l, map->key_size));
-	}
-
-	return NULL;
+	return bpf_percpu_hash_lookup(map, key, smp_processor_id());
 }
 
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value)
@@ -2279,7 +2276,7 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = {
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
 	.map_get_next_key = htab_map_get_next_key,
-	.map_lookup_elem = htab_lru_percpu_map_lookup_elem,
+	.map_lookup_elem = htab_percpu_map_lookup_elem,
 	.map_lookup_and_delete_elem = htab_lru_percpu_map_lookup_and_delete_elem,
 	.map_update_elem = htab_lru_percpu_map_update_elem,
 	.map_delete_elem = htab_lru_map_delete_elem,
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index d124eed97ad7..abed4e1737f6 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -45,6 +45,30 @@ const struct bpf_func_proto bpf_map_lookup_elem_proto = {
 	.arg2_type	= ARG_PTR_TO_MAP_KEY,
 };
 
+BPF_CALL_3(bpf_map_lookup_percpu_elem, struct bpf_map *, map, void *, key,
+	   int, cpu)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_PERCPU_ARRAY:
+		return (unsigned long) bpf_percpu_array_lookup(map, key, cpu);
+	case BPF_MAP_TYPE_PERCPU_HASH:
+	case BPF_MAP_TYPE_LRU_PERCPU_HASH:
+		return (unsigned long) bpf_percpu_hash_lookup(map, key, cpu);
+	default:
+		return (unsigned long) NULL;
+	}
+}
+
+const struct bpf_func_proto bpf_map_lookup_percpu_elem_proto = {
+	.func		= bpf_map_lookup_percpu_elem,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_MAP_KEY,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
 	   void *, value, u64, flags)
 {
@@ -1414,6 +1438,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 	switch (func_id) {
 	case BPF_FUNC_map_lookup_elem:
 		return &bpf_map_lookup_elem_proto;
+	case BPF_FUNC_map_lookup_percpu_elem:
+		return &bpf_map_lookup_percpu_elem_proto;
 	case BPF_FUNC_map_update_elem:
 		return &bpf_map_update_elem_proto;
 	case BPF_FUNC_map_delete_elem:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d175b70067b3..2d7f7c9a970d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5879,6 +5879,12 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
 			goto error;
 		break;
+	case BPF_FUNC_map_lookup_percpu_elem:
+		if (map->map_type != BPF_MAP_TYPE_PERCPU_HASH &&
+		    map->map_type != BPF_MAP_TYPE_LRU_PERCPU_HASH &&
+		    map->map_type != BPF_MAP_TYPE_PERCPU_ARRAY)
+			goto error;
+		break;
 	default:
 		break;
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index fce5535579d6..015ed402c642 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1553,6 +1553,14 @@ union bpf_attr {
  * 		Map value associated to *key*, or **NULL** if no entry was
  * 		found.
  *
+ * void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, int cpu)
+ *	Description
+ *		Perform a lookup in percpu *map* for an entry associated to
+ *		*key* for the given *cpu*.
+ *	Return
+ *		Map value associated to *key* per *cpu*, or **NULL** if no entry
+ *		was found.
+ *
  * long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
  * 	Description
  * 		Add or update the value of the entry associated to *key* in
@@ -5169,6 +5177,7 @@ union bpf_attr {
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
 	FN(map_lookup_elem),		\
+	FN(map_lookup_percpu_elem),	\
 	FN(map_update_elem),		\
 	FN(map_delete_elem),		\
 	FN(probe_read),			\

From patchwork Tue May 10 00:18:04 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yosry Ahmed <yosryahmed@google.com>
X-Patchwork-Id: 12844368
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3BD0CC4332F
	for <netdev@archiver.kernel.org>; Tue, 10 May 2022 00:18:43 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233663AbiEJAWd (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Mon, 9 May 2022 20:22:33 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33044 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233614AbiEJAWa (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 9 May 2022 20:22:30 -0400
Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com
 [IPv6:2607:f8b0:4864:20::104a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3A14B28E4C0
        for <netdev@vger.kernel.org>; Mon,  9 May 2022 17:18:31 -0700 (PDT)
Received: by mail-pj1-x104a.google.com with SMTP id
 z14-20020a17090a66ce00b001dd05e92bb8so377630pjl.4
        for <netdev@vger.kernel.org>; Mon, 09 May 2022 17:18:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=Dj8G52Ku/C8Y7g0RB9bNxhRV9jY61G7LwBKsHFtUSBY=;
        b=T2/IAHh55Gjn5HeBRFdMbJMuKc4A52+bUpWlUzeffogFLw8TTHeFLtYgq22aCDSH0T
         9m4JTdgAoZuuUHr5I3kddwnsxcs8TQAQHDjK0UzNNzDxg/le5nOw/ffd5aukwJ4o5odv
         hqrUXfaotL2uO9PHctmV7H/6ClL+ttTG+dlAwNNowSl/QZnHkF0tmiub103C3Lc+H366
         YaCrxgdzJ1UEE5C+opIeGU7M6jGjOoESS0Kq0WQGfx5CJSHP7FogfGa56LdFw0/8ZBX/
         QYaa6FmxrCLIdWPx0PReW/umNNuLKbGeFGSlAXTWp7OOLAXyLtxRE/8udLP34ZfvVWyw
         p5Rw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=Dj8G52Ku/C8Y7g0RB9bNxhRV9jY61G7LwBKsHFtUSBY=;
        b=PNDwW+9UBvjt14AdcW2MxRu90w/cCsGzVI+GvtRbn38Jtsq5X7yFiiP+qodc028FHa
         2GEQjNpf5pmvdTZPxeKxsMxBoDOWdBPd3rxsKSwy97/S5BW0K81KRofYckB7IJl4g+bN
         52XXc/J9Z6qZTc75Uf7JKr6tiIakeUccQveu13aj+4HmzLsF3UwX6CjKzDs9YwvBzlkr
         8GdETEf9Sa4zyysuq2y9Ea3DOLFPaD3U+X/6WDw02BkLH6CQlJYgUeY2Hj7InvquGuiV
         vOL+liP2/7ogyxJRSSZihgr5IIlEmehFor5nfVw2Ap1yZTZUU3dMYlGI0FzTbfvbu6pN
         4FZg==
X-Gm-Message-State: AOAM530oj7CSF10GDWE8OIlyu5hY+9u9LSNCEOqN66+3vWFvQtnusCLJ
        DAzRZtXwXIizJ6jFKVirpgM+s30Pa7iDFYc4
X-Google-Smtp-Source: 
 ABdhPJwJDYWCoQr30dyGFQU8Ug2nQEwX5PR9g5M5GMtyX4z/IRwxNHeNkD3NdPQj8xC89ISB2tVjV9cDHDleViKi
X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327])
 (user=yosryahmed job=sendgmr) by 2002:a62:fb0f:0:b0:4f2:6d3f:5ffb with SMTP
 id x15-20020a62fb0f000000b004f26d3f5ffbmr17950305pfm.55.1652141910739; Mon,
 09 May 2022 17:18:30 -0700 (PDT)
Date: Tue, 10 May 2022 00:18:04 +0000
In-Reply-To: <20220510001807.4132027-1-yosryahmed@google.com>
Message-Id: <20220510001807.4132027-7-yosryahmed@google.com>
Mime-Version: 1.0
References: <20220510001807.4132027-1-yosryahmed@google.com>
X-Mailer: git-send-email 2.36.0.512.ge40c2bad7a-goog
Subject: [RFC PATCH bpf-next 6/9] cgroup: add v1 support to
 cgroup_get_from_id()
From: Yosry Ahmed <yosryahmed@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <kafai@fb.com>,
        Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>, Hao Luo <haoluo@google.com>,
        Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Shuah Khan <shuah@kernel.org>,
        Roman Gushchin <roman.gushchin@linux.dev>,
        Michal Hocko <mhocko@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>,
        David Rientjes <rientjes@google.com>,
        Greg Thelen <gthelen@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
        bpf@vger.kernel.org, cgroups@vger.kernel.org,
        Yosry Ahmed <yosryahmed@google.com>
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

The current implementation of cgroup_get_from_id() only searches the
default hierarchy for the given id. Make it compatible with cgroup v1 by
looking through all the roots instead.

cgrp_dfl_root should be the first element in the list so there shouldn't
be a performance impact for cgroup v2 users (in the case of a valid id).

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Nacked-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup/cgroup.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index af703cfcb9d2..12700cd21973 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5970,10 +5970,16 @@ void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
  */
 struct cgroup *cgroup_get_from_id(u64 id)
 {
-	struct kernfs_node *kn;
+	struct kernfs_node *kn = NULL;
 	struct cgroup *cgrp = NULL;
+	struct cgroup_root *root;
+
+	for_each_root(root) {
+		kn = kernfs_find_and_get_node_by_id(root->kf_root, id);
+		if (kn)
+			break;
+	}
 
-	kn = kernfs_find_and_get_node_by_id(cgrp_dfl_root.kf_root, id);
 	if (!kn)
 		goto out;
 

From patchwork Tue May 10 00:18:05 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yosry Ahmed <yosryahmed@google.com>
X-Patchwork-Id: 12844370
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 97BDBC433F5
	for <netdev@archiver.kernel.org>; Tue, 10 May 2022 00:18:49 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233728AbiEJAWl (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Mon, 9 May 2022 20:22:41 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33266 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233634AbiEJAWb (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 9 May 2022 20:22:31 -0400
Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com
 [IPv6:2607:f8b0:4864:20::649])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 38E6128ED15
        for <netdev@vger.kernel.org>; Mon,  9 May 2022 17:18:32 -0700 (PDT)
Received: by mail-pl1-x649.google.com with SMTP id
 s2-20020a17090302c200b00158ea215fa2so9029864plk.3
        for <netdev@vger.kernel.org>; Mon, 09 May 2022 17:18:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=2lpWRE3YG+N9rpa27nVaVWWHXRfllKAdvtsLrNhKAK8=;
        b=E2tVTWuknXlQ+T7MTDseqm2kLasAoY5TbquTKQN6G7HFagvNqmuYGwrGCvXPasSU3N
         X7Hfaz6BkAsSlJCS8floN1LW1Ls3ez8PP3Qk4CDVxNmG8LczACkldG8ojDGR0v0+Q7lZ
         21MumA+UYt52X45SeiI4gFrF5deapKiOdu+hj+4jT0AI8QNFnGPAtcYnE+rE1vLh3I/p
         DgTM+ee4LV0B+gFa0sepONWuiGDP56tlAl4OGQqNbhtSDbkgWBB0l4UraO/yy1Apvxzl
         ZeZuNJv+q2z07LlTR07+RdQp7xf+QnHW2FhcVMIRWFBWLCQI9fLvZVExE5a870tRXnw3
         sazg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=2lpWRE3YG+N9rpa27nVaVWWHXRfllKAdvtsLrNhKAK8=;
        b=i4jre3vb++vggjBUE5hP1zf/e+WqjtHHPGKnt4W/ckM+0a0yrJinryJ2glTw/3l9Ds
         Q9FB07DJ4E1/QLleUPJYikpNkpUWioA+5Pc2R5kuYHiGrTF2iosRiskvObhBsdBipmJy
         5oXdGQLYYG7foqHAXWvdVgk8h1G/7aIGfM773kqsju2KHT7EFks4xE0vX3HOKwKVk1TS
         uhCjZyMir1SZzTQo5lpsX/zYnMnXG5G6//rKWfmGTUvt5i0or0a7ytpIznpj8YRv8viX
         gUjR4RlrnztwO+98C/TiJwyP1V0LoXcNsB5EHlSYPSgeDdNRElHGkXzAWMmcj+ag4bOq
         9ztg==
X-Gm-Message-State: AOAM5335ZGopO/2I/+y4qVhLxflIOfmD+qAtm4HYEzc7lUH7OE25uZfN
        ubFuVamC+OW7euuhjFNJB/v33fB2J4EkEVnE
X-Google-Smtp-Source: 
 ABdhPJy/UitjFbSu8JI5SJOkgoBGVuKW70iU2528pDg3GbFKneC3JYejg17wXmGVU2MICHcJbmM/y4RyYnBm86oa
X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327])
 (user=yosryahmed job=sendgmr) by 2002:a63:8ac7:0:b0:3aa:fa62:5a28 with SMTP
 id y190-20020a638ac7000000b003aafa625a28mr15111188pgd.400.1652141912340; Mon,
 09 May 2022 17:18:32 -0700 (PDT)
Date: Tue, 10 May 2022 00:18:05 +0000
In-Reply-To: <20220510001807.4132027-1-yosryahmed@google.com>
Message-Id: <20220510001807.4132027-8-yosryahmed@google.com>
Mime-Version: 1.0
References: <20220510001807.4132027-1-yosryahmed@google.com>
X-Mailer: git-send-email 2.36.0.512.ge40c2bad7a-goog
Subject: [RFC PATCH bpf-next 7/9] cgroup: Add cgroup_put() in !CONFIG_CGROUPS
 case
From: Yosry Ahmed <yosryahmed@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <kafai@fb.com>,
        Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>, Hao Luo <haoluo@google.com>,
        Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Shuah Khan <shuah@kernel.org>,
        Roman Gushchin <roman.gushchin@linux.dev>,
        Michal Hocko <mhocko@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>,
        David Rientjes <rientjes@google.com>,
        Greg Thelen <gthelen@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
        bpf@vger.kernel.org, cgroups@vger.kernel.org,
        Yosry Ahmed <yosryahmed@google.com>
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

From: Hao Luo <haoluo@google.com>

There is already a cgroup_get_from_id() in the !CONFIG_CGROUPS case,
let's have a matching cgroup_put() in !CONFIG_CGROUPS too.

Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/cgroup.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5408c74d5c44..4f1d8febb9fd 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -759,6 +759,9 @@ static inline struct cgroup *cgroup_get_from_id(u64 id)
 {
 	return NULL;
 }
+
+static inline struct cgroup *cgroup_put(void)
+{}
 #endif /* !CONFIG_CGROUPS */
 
 #ifdef CONFIG_CGROUPS

From patchwork Tue May 10 00:18:06 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yosry Ahmed <yosryahmed@google.com>
X-Patchwork-Id: 12844371
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7E8C2C433EF
	for <netdev@archiver.kernel.org>; Tue, 10 May 2022 00:18:51 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233743AbiEJAWn (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Mon, 9 May 2022 20:22:43 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33298 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233640AbiEJAWb (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 9 May 2022 20:22:31 -0400
Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com
 [IPv6:2607:f8b0:4864:20::104a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 741FC28F1C6
        for <netdev@vger.kernel.org>; Mon,  9 May 2022 17:18:34 -0700 (PDT)
Received: by mail-pj1-x104a.google.com with SMTP id
 o16-20020a17090ab89000b001d84104fc2cso170043pjr.1
        for <netdev@vger.kernel.org>; Mon, 09 May 2022 17:18:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=1DdKbbzvRaz5T1jHgNMHWUuX5VoFw/KcWqMsFaoDHew=;
        b=Qdk/uVBN5FirguzUtk6YBSinwsbw0mWLPIMl0G6mPkw3WWKgrwiYgoK2MxCAb9/0xG
         QkEsszD9SqFMBanTKgXAEsSdCXcNQfoJwo18x+pQSYtGhq6nE+P+f4ulCSi33EYn4MlA
         G2Q8OuGBLOBBi6hp/LY8ttDQ8lHKcgBTudw/sc3UqJXiAKv/2awpINYR3jNu1K9oLHfh
         RJwCDneJtlrFR6mbkc92+QNjjK3m4kLSgCtJq1OiuWLKlhGf99TYFFpyluxL8obtigKi
         EcI5syCaQZqsDAxsU/qL0n/Uue7arYAcP6eocXGX1W4fn+HyN3Tr1WCHNZBzSZcd5uP2
         Y0CA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=1DdKbbzvRaz5T1jHgNMHWUuX5VoFw/KcWqMsFaoDHew=;
        b=PkjBL2MwJEF9HoxRT0ULTmGS01ydvI7p0yfY+i8eCpYcs5KbuBwQDljxpSl7gVM8S8
         XGC2BPrJW7NOGySRzdmHfnF7vA3VPaSe/sdpgASCqJ3uRuS9YiPvFh7y+bZ5Morw2I3H
         0wSmDmlUDdiLvje/B9+/0+eHT9sTqXCpPBSrFcoaw+7CIZZN5vhlDAjqV3VSJCnt+VsJ
         XmUW7qj0m2jhYtSBuA5OynvkUt1R60uN/iBmYGTsjAT0bbNckm0qseWILPs4EM0s7p29
         Ag+K84aAiENdUADR+HDwdrxWVpTG2WwmJUcNPZ/zcfKYgndg+0XrIWmMw9o738amdwJW
         A+mg==
X-Gm-Message-State: AOAM532dYpcPF4JccT91q0/pI8Wb4MWXW33efYUDxsBaXmr2C0d3Sf/u
        u6O2m+P63Ja9TYkzvqyJqzLyo0iHWmPDaz5k
X-Google-Smtp-Source: 
 ABdhPJxW3Y3otQ8oGh68TJSVM4gVMhcZkb3p6z1kA/0Cra4Bi1Vbw2AbwNpT6FalWVUtXvLI5mlPyIc5pkl1bhTj
X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327])
 (user=yosryahmed job=sendgmr) by 2002:a62:5c3:0:b0:50d:4274:4e9d with SMTP id
 186-20020a6205c3000000b0050d42744e9dmr17763400pff.54.1652141913816; Mon, 09
 May 2022 17:18:33 -0700 (PDT)
Date: Tue, 10 May 2022 00:18:06 +0000
In-Reply-To: <20220510001807.4132027-1-yosryahmed@google.com>
Message-Id: <20220510001807.4132027-9-yosryahmed@google.com>
Mime-Version: 1.0
References: <20220510001807.4132027-1-yosryahmed@google.com>
X-Mailer: git-send-email 2.36.0.512.ge40c2bad7a-goog
Subject: [RFC PATCH bpf-next 8/9] bpf: Introduce cgroup iter
From: Yosry Ahmed <yosryahmed@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <kafai@fb.com>,
        Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>, Hao Luo <haoluo@google.com>,
        Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Shuah Khan <shuah@kernel.org>,
        Roman Gushchin <roman.gushchin@linux.dev>,
        Michal Hocko <mhocko@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>,
        David Rientjes <rientjes@google.com>,
        Greg Thelen <gthelen@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
        bpf@vger.kernel.org, cgroups@vger.kernel.org,
        Yosry Ahmed <yosryahmed@google.com>
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

From: Hao Luo <haoluo@google.com>

Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
iter doesn't iterate a set of kernel objects. Instead, it is supposed to
be parameterized by a cgroup id and prints only that cgroup. So one
needs to specify a target cgroup id when attaching this iter. The target
cgroup's state can be read out via a link of this iter.

Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf.h            |   2 +
 include/uapi/linux/bpf.h       |   6 ++
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/cgroup_iter.c       | 148 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |   6 ++
 5 files changed, 163 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/cgroup_iter.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f6fa35ffe311..f472f43521d2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -43,6 +43,7 @@ struct kobject;
 struct mem_cgroup;
 struct module;
 struct bpf_func_state;
+struct cgroup;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1601,6 +1602,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
 
 struct bpf_iter_aux_info {
 	struct bpf_map *map;
+	struct cgroup *cgroup;
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 015ed402c642..096c521e34de 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5963,6 +5966,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 6caf4a61e543..07a715b54190 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
-obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
+obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
new file mode 100644
index 000000000000..86bdfe135d24
--- /dev/null
+++ b/kernel/bpf/cgroup_iter.c
@@ -0,0 +1,148 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Google */
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/cgroup.h>
+#include <linux/kernel.h>
+#include <linux/seq_file.h>
+
+struct bpf_iter__cgroup {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct cgroup *, cgroup);
+};
+
+static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	/* Only one session is supported. */
+	if (*pos > 0)
+		return NULL;
+
+	if (*pos == 0)
+		++*pos;
+
+	return *(struct cgroup **)seq->private;
+}
+
+static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	++*pos;
+	return NULL;
+}
+
+static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpf_iter__cgroup ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	ctx.meta = &meta;
+	ctx.cgroup = v;
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, false);
+	if (prog)
+		ret = bpf_iter_run_prog(prog, &ctx);
+
+	return ret;
+}
+
+static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
+{
+}
+
+static const struct seq_operations cgroup_iter_seq_ops = {
+	.start  = cgroup_iter_seq_start,
+	.next   = cgroup_iter_seq_next,
+	.stop   = cgroup_iter_seq_stop,
+	.show   = cgroup_iter_seq_show,
+};
+
+BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
+
+static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+	*(struct cgroup **)priv_data = aux->cgroup;
+	return 0;
+}
+
+static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
+	.seq_ops                = &cgroup_iter_seq_ops,
+	.init_seq_private       = cgroup_iter_seq_init,
+	.seq_priv_size          = sizeof(struct cgroup *),
+};
+
+static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
+				  union bpf_iter_link_info *linfo,
+				  struct bpf_iter_aux_info *aux)
+{
+	struct cgroup *cgroup;
+
+	cgroup = cgroup_get_from_id(linfo->cgroup.cgroup_id);
+	if (!cgroup)
+		return -EBUSY;
+
+	aux->cgroup = cgroup;
+	return 0;
+}
+
+static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
+{
+	if (aux->cgroup)
+		cgroup_put(aux->cgroup);
+}
+
+static void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
+					struct seq_file *seq)
+{
+	char *buf;
+
+	seq_printf(seq, "cgroup_id:\t%llu\n", cgroup_id(aux->cgroup));
+
+	buf = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!buf) {
+		seq_puts(seq, "cgroup_path:\n");
+		return;
+	}
+
+	/* If cgroup_path_ns() fails, buf will be an empty string, cgroup_path
+	 * will print nothing.
+	 *
+	 * Cgroup_path is the path in the calliing process's cgroup namespace.
+	 */
+	cgroup_path_ns(aux->cgroup, buf, sizeof(buf),
+		       current->nsproxy->cgroup_ns);
+	seq_printf(seq, "cgroup_path:\t%s\n", buf);
+	kfree(buf);
+}
+
+static int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
+					  struct bpf_link_info *info)
+{
+	info->iter.cgroup.cgroup_id = cgroup_id(aux->cgroup);
+	return 0;
+}
+
+DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
+		     struct cgroup *cgroup)
+
+static struct bpf_iter_reg bpf_cgroup_reg_info = {
+	.target			= "cgroup",
+	.attach_target		= bpf_iter_attach_cgroup,
+	.detach_target		= bpf_iter_detach_cgroup,
+	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
+	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
+	.ctx_arg_info_size	= 1,
+	.ctx_arg_info		= {
+		{ offsetof(struct bpf_iter__cgroup, cgroup),
+		  PTR_TO_BTF_ID },
+	},
+	.seq_info		= &cgroup_iter_seq_info,
+};
+
+static int __init bpf_cgroup_iter_init(void)
+{
+	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
+	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
+}
+
+late_initcall(bpf_cgroup_iter_init);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 015ed402c642..096c521e34de 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5963,6 +5966,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {

From patchwork Tue May 10 00:18:07 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yosry Ahmed <yosryahmed@google.com>
X-Patchwork-Id: 12844372
X-Patchwork-Delegate: bpf@iogearbox.net
Return-Path: <netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EE5BEC433EF
	for <netdev@archiver.kernel.org>; Tue, 10 May 2022 00:18:53 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233684AbiEJAWp (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Mon, 9 May 2022 20:22:45 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33098 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233661AbiEJAWc (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 9 May 2022 20:22:32 -0400
Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com
 [IPv6:2607:f8b0:4864:20::64a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2566028C9C2
        for <netdev@vger.kernel.org>; Mon,  9 May 2022 17:18:36 -0700 (PDT)
Received: by mail-pl1-x64a.google.com with SMTP id
 i4-20020a17090332c400b0015f099f9582so3264074plr.11
        for <netdev@vger.kernel.org>; Mon, 09 May 2022 17:18:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=yleCgo5t9C6YacjTYCtxTmlRlPjnqWARraxO62V68mA=;
        b=Fp4BJuetfuu6Ty0lIR/0DFM7sGyxvJ+bf+ulXF9vG82QDrH+IEkboQxEjoPgcm/vxq
         HvNdU5oIysMV32+GOvDlYQ2xGBcQZC4OKSTfQ0JPIMEUMLvxu1/u/ISWWWIz3+5Ojrxn
         WCVMCbYTD3FiN/6Vc05CDBzDVgcx24F4HkTAV3C3aEMj0bnLLNRujIqeR0V7iPD94C+A
         XJDuojBy85yQJBGBAhFPVgjBVJ+YPCtvblcc532kThu+M6QDgb02OTC/kBDtaIWoZd+v
         ZVvApmbFfKQRM+mwtmL5n7qO/MTmkftMpVjvFQzmB5tj0zQaATeErjclHX1AE2o/2sB5
         x5rQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=yleCgo5t9C6YacjTYCtxTmlRlPjnqWARraxO62V68mA=;
        b=vsbSDhwGv3fjBcA39ka0hg6VEyWbDpNsNG3JuUkk5nNB6Rse1EJS/r8lr7h5p4myY1
         P/Px0s7+DhB29rPoMaQjHokjvumojVQNOVxcgYE05/BRv71N3Uu7SUikEXVk695wgpOq
         cYYKzJOEPOiKCCp+pg4zjTYqtovJ90bqTNxUFtGS/JQ8ER9DNr1GmBBruV1xfGgexOVd
         x2GkHVONPJPflK61/XbHf0DHUrwmIAYNBgJO6zTJRldPfUSifCML1rzXy6DlM7dcgxDa
         AN59W9nvJvsZ/VGZasMkhn3hDkF4CxYLbo4ZPF+2W1VW7c6BDv/k57OI4tjzA0nhlr/p
         1THQ==
X-Gm-Message-State: AOAM532ISPf2TUCbqCchbexDQ/6hVCnxM/ltc+bltKwOdpJUbv4tGRE1
        5rLVvP94eTohLz8qN91EUzLE9RQoy8N+gCUH
X-Google-Smtp-Source: 
 ABdhPJx4SZliFgzZyoTIVBSCG5wxcX7td1JaH//1a81pbc8lkx3duw6U7LddmTeTtox8F6g3I1LlRI+yhqY0Cjsl
X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327])
 (user=yosryahmed job=sendgmr) by 2002:a17:903:30c3:b0:15b:4232:e5d2 with SMTP
 id s3-20020a17090330c300b0015b4232e5d2mr18030341plc.167.1652141915596; Mon,
 09 May 2022 17:18:35 -0700 (PDT)
Date: Tue, 10 May 2022 00:18:07 +0000
In-Reply-To: <20220510001807.4132027-1-yosryahmed@google.com>
Message-Id: <20220510001807.4132027-10-yosryahmed@google.com>
Mime-Version: 1.0
References: <20220510001807.4132027-1-yosryahmed@google.com>
X-Mailer: git-send-email 2.36.0.512.ge40c2bad7a-goog
Subject: [RFC PATCH bpf-next 9/9] selftest/bpf: add a selftest for cgroup
 hierarchical stats
From: Yosry Ahmed <yosryahmed@google.com>
To: Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <kafai@fb.com>,
        Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>, Hao Luo <haoluo@google.com>,
        Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Shuah Khan <shuah@kernel.org>,
        Roman Gushchin <roman.gushchin@linux.dev>,
        Michal Hocko <mhocko@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>,
        David Rientjes <rientjes@google.com>,
        Greg Thelen <gthelen@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
        bpf@vger.kernel.org, cgroups@vger.kernel.org,
        Yosry Ahmed <yosryahmed@google.com>
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org
X-Patchwork-Delegate: bpf@iogearbox.net
X-Patchwork-State: RFC

Add a selftest that tests the whole workflow for collecting,
aggregating, and display cgroup hierarchical stats.

The test loads tracing bpf programs at the beginning and ending of
direct reclaim to measure the vmscan latency. Per-cgroup readings are
stored in percpu maps for efficiency. When a cgroup reading is updated,
bpf_cgroup_rstat_updated() is called to add the cgroup (and the current
cpu) to the rstat updated tree. When a cgroup is added to the rstat
updated tree, all its parents are added as well. rstat makes sure
cgroups are popped in a bottom up fashion.

When an rstat flush is invoked, an rstat flusher program is called for
per-cgroup per-cpu pairs on the updated tree. The program aggregates
percpu readings to a total reading, and also propagates them to the
parent. After rstat flushing is over, the program will have been invoked
for all (cgroup, cpu) pairs that have updates as well as their parents,
so the whole hierarchy will have updated (flushed) stats.

Finally, a cgroup_iter program is pinned to a file for each cgroup.
Reading this file invokes the cgroup_iter program to flush the stats and
display them to the user.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 .../test_cgroup_hierarchical_stats.c          | 335 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../selftests/bpf/progs/cgroup_vmscan.c       | 211 +++++++++++
 3 files changed, 553 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_vmscan.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
new file mode 100644
index 000000000000..7c4d199967d7
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
@@ -0,0 +1,335 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mount.h>
+#include <unistd.h>
+#include <errno.h>
+
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+#include <test_progs.h>
+
+#include "cgroup_vmscan.skel.h"
+
+#define PAGE_SIZE 4096
+#define MB(x) (x << 20)
+
+#define BPFFS_ROOT "/sys/fs/bpf/"
+#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
+#define CGROUP_ROOT "/sys/fs/cgroup/"
+
+#define RET_IF_ERR(exp, t, f...) ({		\
+	int ___res = (exp);			\
+	if (CHECK(___res, t, f))		\
+		return ___res;			\
+})
+
+struct cgroup_path {
+	const char *name, *path;
+};
+
+#define CGROUP_PATH(p, n) {.name = #n, .path = CGROUP_ROOT#p"/"#n}
+#define CGROUP_ROOT_PATH {.name = "root", .path = CGROUP_ROOT}
+
+static struct cgroup_path cgroup_hierarchy[] = {
+	CGROUP_ROOT_PATH,
+	CGROUP_PATH(, test),
+	CGROUP_PATH(test, child1),
+	CGROUP_PATH(test, child2),
+	CGROUP_PATH(test/child1, child1_1),
+	CGROUP_PATH(test/child1, child1_2),
+	CGROUP_PATH(test/child2, child2_1),
+	CGROUP_PATH(test/child2, child2_2),
+};
+
+#define N_CGROUPS (sizeof(cgroup_hierarchy)/sizeof(struct cgroup_path))
+
+static const int non_leaf_cgroups = 4;
+static __u64 cgroup_ids[N_CGROUPS];
+
+static int duration;
+
+static __u64 cgroup_id_from_path(const char *cgroup_path)
+{
+	struct stat file_stat;
+
+	if (stat(cgroup_path, &file_stat))
+		return -1;
+	return file_stat.st_ino;
+}
+
+int write_to_file(const char *path, const char *buf, size_t size)
+{
+	int fd, len, err = 0;
+
+	fd = open(path, O_WRONLY);
+	if (fd < 0)
+		return -errno;
+	len = write(fd, buf, size);
+	if (len < 0)
+		err = -errno;
+	else if (len < size)
+		err = -1;
+	close(fd);
+	return err;
+}
+
+int read_from_file(const char *path, char *buf, size_t size)
+{
+	int fd, len;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return -errno;
+	len = read(fd, buf, size);
+	if (len >= 0)
+		buf[len] = 0;
+	close(fd);
+	return len < 0 ? -errno : 0;
+}
+
+int setup_hierarchy(void)
+{
+	int i, len;
+	char path[128];
+
+	/* Mount bpffs, and create a directory to pin cgroup_iters in */
+	RET_IF_ERR(mount("bpf", BPFFS_ROOT, "bpf", 0, NULL), "mount",
+		   "failed to mount bpffs at %s (%s)\n", BPFFS_ROOT,
+		   strerror(errno));
+	RET_IF_ERR(mkdir(BPFFS_VMSCAN, 0755), "mkdir",
+		   "failed to mkdir %s (%s)\n", BPFFS_VMSCAN, strerror(errno));
+
+	/* Mount cgroup v2 */
+	RET_IF_ERR(mount("none", CGROUP_ROOT, "cgroup2", 0, NULL),
+		   "mount", "failed to mount cgroup2 at %s (%s)\n",
+		   CGROUP_ROOT, strerror(errno));
+
+	/* Enable memory controller in cgroup v2 root */
+	len = snprintf(path, 128, "%scgroup.subtree_control", CGROUP_ROOT);
+	RET_IF_ERR(write_to_file(path, "+memory", len), "+memory",
+		   "+memory failed in root (%s)\n",
+		   strerror(errno));
+	/* Root cgroup id is 1 in v2*/
+	cgroup_ids[0] = 1;
+
+	for (i = 1; i < N_CGROUPS; i++) {
+		/* Create cgroup */
+		RET_IF_ERR(mkdir(cgroup_hierarchy[i].path, 0666),
+			   "mkdir", "failed to mkdir %s (%s)\n",
+			   cgroup_hierarchy[i].path, strerror(errno));
+
+		cgroup_ids[i] = cgroup_id_from_path(cgroup_hierarchy[i].path);
+
+		/* Enable memory controller non-leaf cgroups */
+		if (i < non_leaf_cgroups)  {
+			len = snprintf(path, 128, "%s/cgroup.subtree_control",
+				       cgroup_hierarchy[i].path);
+			RET_IF_ERR(write_to_file(path, "+memory", len),
+				   "+memory", "+memory failed in %s (%s)\n",
+				   cgroup_hierarchy[i].name, strerror(errno));
+		}
+	}
+	return 0;
+}
+
+void destroy_hierarchy(void)
+{
+	int i;
+	char path[128];
+
+	for (i = N_CGROUPS - 1; i >= 0; i--) {
+		/* Delete files in bpffs that cgroup_iters are pinned in */
+		snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
+			 cgroup_hierarchy[i].name);
+		CHECK(remove(path), "remove", "failed to remove %s (%s)\n",
+		      path, strerror(errno));
+
+		if (i == 0)
+			break;
+
+		/* Delete cgroup */
+		CHECK(rmdir(cgroup_hierarchy[i].path), "rmdir",
+		      "failed to rmdir %s (%s)\n", cgroup_hierarchy[i].path,
+		      strerror(errno));
+	}
+	/* Remove created directory in bpffs */
+	CHECK(rmdir(BPFFS_VMSCAN), "rmdir", "failed to rmdir %s (%s)\n",
+	      BPFFS_VMSCAN, strerror(errno));
+	/* Unmount bpffs */
+	CHECK(umount(BPFFS_ROOT), "umount", "failed to unmount bpffs (%s)\n",
+	      strerror(errno));
+	/* Unmount cgroup v2 */
+	CHECK(umount(CGROUP_ROOT), "umount", "failed to unmount cgroup2 (%s)\n",
+	      strerror(errno));
+}
+
+void alloc_anon(size_t size)
+{
+	char *buf, *ptr;
+
+	buf = malloc(size);
+	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+		*ptr = 0;
+	free(buf);
+}
+
+int induce_vmscan(void)
+{
+	char cmd[128], path[128];
+	int i, pid, len;
+
+	/*
+	 * Set memory.high for test parent cgroup to 1 MB to throttle
+	 * allocations and invoke reclaim in children.
+	 */
+	snprintf(path, 128, "%s/memory.high", cgroup_hierarchy[1].path);
+	len = snprintf(cmd, 128, "%d", MB(1));
+	RET_IF_ERR(write_to_file(path, cmd, len), "memory.high",
+		   "failed to write to %s (%s)\n", path, strerror(errno));
+
+	/*
+	 * In every leaf cgroup, run a memory hog for a few seconds to induce
+	 * reclaim then kill it.
+	 */
+	for (i = non_leaf_cgroups; i < N_CGROUPS; i++) {
+		pid = fork();
+		if (pid == 0) {
+			pid = getpid();
+
+			/* Add child to leaf cgroup */
+			snprintf(path, 128, "%s/cgroup.procs",
+				 cgroup_hierarchy[i].path);
+			len = snprintf(cmd, 128, "%d", pid);
+			RET_IF_ERR(write_to_file(path, cmd, len),
+				   "cgroup.procs",
+				   "failed to add pid %d to cgroup %s (%s)\n",
+				   pid, cgroup_hierarchy[i].name,
+				   strerror(errno));
+
+			/* Allocate 2 MB  */
+			alloc_anon(MB(2));
+			exit(0);
+		} else {
+			/* Wait for child to cause reclaim then kill it */
+			sleep(3);
+			kill(pid, SIGKILL);
+			waitpid(pid, NULL, 0);
+		}
+	}
+	return 0;
+}
+
+int check_vmscan_stats(void)
+{
+	char buf[128], path[128];
+	int i;
+	__u64 vmscan_readings[N_CGROUPS];
+
+	for (i = 0; i < N_CGROUPS; i++) {
+		__u64 id;
+
+		/* For every cgroup, read the file generated by cgroup_iter */
+		snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
+			cgroup_hierarchy[i].name);
+		RET_IF_ERR(read_from_file(path, buf, 128), "read",
+			   "failed to read from %s (%s)\n",
+			   path, strerror(errno));
+		/* Check the output file formatting */
+		ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n",
+				 &id, &vmscan_readings[i]), 2, "output format");
+
+		/* Check that the cgroup_id is displayed correctly */
+		ASSERT_EQ(cgroup_ids[i], id, "cgroup_id");
+		/* Check that the vmscan reading is non-zerp */
+		ASSERT_NEQ(vmscan_readings[i], 0, "vmscan_reading");
+	}
+
+	/* Check that child1 == child1_1 + child1_2 */
+	ASSERT_EQ(vmscan_readings[2], vmscan_readings[4] + vmscan_readings[5],
+		  "child1_vmscan");
+	/* Check that child2 == child2_1 + child2_2 */
+	ASSERT_EQ(vmscan_readings[3], vmscan_readings[6] + vmscan_readings[7],
+		  "child2_vmscan");
+	/* Check that test == child1 + child2 */
+	ASSERT_EQ(vmscan_readings[1], vmscan_readings[2] + vmscan_readings[3],
+		  "test_vmscan");
+	/* Check that root >= test */
+	ASSERT_GE(vmscan_readings[0], vmscan_readings[1], "root_vmscan");
+
+	return 0;
+}
+
+int setup_progs(struct cgroup_vmscan **skel)
+{
+	int i;
+	struct bpf_link *link;
+	struct cgroup_vmscan *obj;
+
+	obj = cgroup_vmscan__open_and_load();
+	if (!ASSERT_OK_PTR(obj, "open_and_load"))
+		return libbpf_get_error(obj);
+
+	/* Attach rstat flusher to memory subsystem */
+	link = bpf_program__attach_subsys(obj->progs.vmscan_flush, "memory");
+	if (!ASSERT_OK_PTR(link, "attach_subsys"))
+		return libbpf_get_error(link);
+
+	/* Attach cgroup_iter program that will dump the stats to cgroups */
+	for (i = 0; i < N_CGROUPS; i++) {
+		DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+		union bpf_iter_link_info linfo = {};
+		char path[128];
+
+		/* Create an iter link, parameterized by cgroup id */
+		linfo.cgroup.cgroup_id = cgroup_ids[i];
+		opts.link_info = &linfo;
+		opts.link_info_len = sizeof(linfo);
+		link = bpf_program__attach_iter(obj->progs.dump_vmscan, &opts);
+		if (!ASSERT_OK_PTR(link, "attach_iter"))
+			return libbpf_get_error(link);
+
+		/* Pin the link to a bpffs file */
+		snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
+			 cgroup_hierarchy[i].name);
+		bpf_link__pin(link, path);
+	}
+
+	/* Attach tracing programs that will calculate vmscan delays */
+	link = bpf_program__attach(obj->progs.vmscan_start);
+	if (!ASSERT_OK_PTR(obj, "attach"))
+		return libbpf_get_error(obj);
+
+	link = bpf_program__attach(obj->progs.vmscan_end);
+	if (!ASSERT_OK_PTR(obj, "attach"))
+		return libbpf_get_error(obj);
+
+	*skel = obj;
+	return 0;
+}
+
+void destroy_progs(struct cgroup_vmscan *skel)
+{
+	cgroup_vmscan__destroy(skel);
+}
+
+void test_cgroup_hierarchical_stats(void)
+{
+	struct cgroup_vmscan *skel = NULL;
+
+	if (setup_hierarchy())
+		goto cleanup;
+	if (setup_progs(&skel))
+		goto cleanup;
+	if (induce_vmscan())
+		goto cleanup;
+	check_vmscan_stats();
+cleanup:
+	destroy_progs(skel);
+	destroy_hierarchy();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/selftests/bpf/progs/bpf_iter.h
index 8cfaeba1ddbf..b10ad01e878a 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter.h
+++ b/tools/testing/selftests/bpf/progs/bpf_iter.h
@@ -16,6 +16,7 @@
 #define bpf_iter__bpf_map_elem bpf_iter__bpf_map_elem___not_used
 #define bpf_iter__bpf_sk_storage_map bpf_iter__bpf_sk_storage_map___not_used
 #define bpf_iter__sockmap bpf_iter__sockmap___not_used
+#define bpf_iter__cgroup bpf_iter__cgroup__not_used
 #define btf_ptr btf_ptr___not_used
 #define BTF_F_COMPACT BTF_F_COMPACT___not_used
 #define BTF_F_NONAME BTF_F_NONAME___not_used
@@ -37,6 +38,7 @@
 #undef bpf_iter__bpf_map_elem
 #undef bpf_iter__bpf_sk_storage_map
 #undef bpf_iter__sockmap
+#undef bpf_iter__cgroup
 #undef btf_ptr
 #undef BTF_F_COMPACT
 #undef BTF_F_NONAME
@@ -132,6 +134,11 @@ struct bpf_iter__sockmap {
 	struct sock *sk;
 };
 
+struct bpf_iter__cgroup {
+	struct bpf_iter_meta *meta;
+	struct cgroup *cgroup;
+} __attribute((preserve_access_index));
+
 struct btf_ptr {
 	void *ptr;
 	__u32 type_id;
diff --git a/tools/testing/selftests/bpf/progs/cgroup_vmscan.c b/tools/testing/selftests/bpf/progs/cgroup_vmscan.c
new file mode 100644
index 000000000000..41516f8263b3
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_vmscan.c
@@ -0,0 +1,211 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+
+char _license[] SEC("license") = "GPL";
+
+/*
+ * Start times are stored per-task, not per-cgroup, as multiple tasks in one
+ * cgroup can perform reclain concurrently.
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, __u64);
+} vmscan_start_time SEC(".maps");
+
+struct vmscan_percpu {
+	/* Previous percpu state, to figure out if we have new updates */
+	__u64 prev;
+	/* Current percpu state */
+	__u64 state;
+};
+
+struct vmscan {
+	/* State propagated through children, pending aggregation */
+	__u64 pending;
+	/* Total state, including all cpus and all children */
+	__u64 state;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
+	__uint(max_entries, 10);
+	__type(key, __u64);
+	__type(value, struct vmscan_percpu);
+} pcpu_cgroup_vmscan_elapsed SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 10);
+	__type(key, __u64);
+	__type(value, struct vmscan);
+} cgroup_vmscan_elapsed SEC(".maps");
+
+static inline struct cgroup *task_memcg(struct task_struct *task)
+{
+	return BPF_CORE_READ(task, cgroups, subsys[memory_cgrp_id], cgroup);
+}
+
+static inline uint64_t cgroup_id(struct cgroup *cgrp)
+{
+	return BPF_CORE_READ(cgrp, kn, id);
+}
+
+static inline int create_vmscan_percpu_elem(__u64 cg_id, __u64 state)
+{
+	struct vmscan_percpu pcpu_init = {.state = state, .prev = 0};
+
+	if (bpf_map_update_elem(&pcpu_cgroup_vmscan_elapsed, &cg_id,
+				&pcpu_init, BPF_NOEXIST)) {
+		bpf_printk("failed to create pcpu entry for cgroup %llu\n"
+			   , cg_id);
+		return 1;
+	}
+	return 0;
+}
+
+static inline int create_vmscan_elem(__u64 cg_id, __u64 state, __u64 pending)
+{
+	struct vmscan init = {.state = state, .pending = pending};
+
+	if (bpf_map_update_elem(&cgroup_vmscan_elapsed, &cg_id,
+				&init, BPF_NOEXIST)) {
+		bpf_printk("failed to create entry for cgroup %llu\n"
+			   , cg_id);
+		return 1;
+	}
+	return 0;
+}
+
+SEC("raw_tp/mm_vmscan_memcg_reclaim_begin")
+int vmscan_start(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+	__u64 *start_time_ptr;
+
+	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, task, 0,
+					  BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!start_time_ptr) {
+		bpf_printk("error retrieving storage\n");
+		return 0;
+	}
+
+	*start_time_ptr = bpf_ktime_get_ns();
+	return 0;
+}
+
+SEC("raw_tp/mm_vmscan_memcg_reclaim_end")
+int vmscan_end(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct task_struct *current = bpf_get_current_task_btf();
+	struct cgroup *cgrp = task_memcg(current);
+	__u64 *start_time_ptr;
+	__u64 current_elapsed, cg_id;
+	__u64 end_time = bpf_ktime_get_ns();
+
+	/* cgrp may not have memory controller enabled */
+	if (!cgrp)
+		return 0;
+
+	cg_id = cgroup_id(cgrp);
+	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, current, 0,
+					      BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!start_time_ptr) {
+		bpf_printk("error retrieving storage local storage\n");
+		return 0;
+	}
+
+	current_elapsed = end_time - *start_time_ptr;
+	pcpu_stat = bpf_map_lookup_elem(&pcpu_cgroup_vmscan_elapsed,
+					&cg_id);
+	if (pcpu_stat)
+		__sync_fetch_and_add(&pcpu_stat->state, current_elapsed);
+	else
+		create_vmscan_percpu_elem(cg_id, current_elapsed);
+
+	bpf_cgroup_rstat_updated(cgrp);
+	return 0;
+}
+
+SEC("cgroup_subsys/rstat")
+int vmscan_flush(struct bpf_rstat_ctx *ctx)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct vmscan *total_stat, *parent_stat;
+	__u64 *pcpu_vmscan;
+	__u64 state;
+	__u64 delta = 0;
+	__u64 cg_id = ctx->cgroup_id;
+	__u64 parent_cg_id = ctx->parent_cgroup_id;
+	__s32 cpu = ctx->cpu;
+
+	/* Add CPU changes on this level since the last flush */
+	pcpu_stat = bpf_map_lookup_percpu_elem(&pcpu_cgroup_vmscan_elapsed,
+					       &cg_id, cpu);
+	if (pcpu_stat) {
+		state = pcpu_stat->state;
+		delta += state - pcpu_stat->prev;
+		pcpu_stat->prev = state;
+	}
+
+	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		create_vmscan_elem(cg_id, delta, 0);
+		goto update_parent;
+	}
+
+	/* Collect pending stats from subtree */
+	if (total_stat->pending) {
+		delta += total_stat->pending;
+		total_stat->pending = 0;
+	}
+
+	/* Propagate changes to this cgroup's total */
+	total_stat->state += delta;
+
+update_parent:
+	/* Skip if there are no changes to propagate, or no parent */
+	if (!delta || !parent_cg_id)
+		return 0;
+
+	/* Propagate changes to cgroup's parent */
+	parent_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed,
+					  &parent_cg_id);
+	if (parent_stat)
+		parent_stat->pending += delta;
+	else
+		create_vmscan_elem(parent_cg_id, 0, delta);
+
+	return 0;
+}
+
+SEC("iter/cgroup")
+int dump_vmscan(struct bpf_iter__cgroup *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct cgroup *cgroup = ctx->cgroup;
+	struct vmscan *total_stat;
+	__u64 cg_id = cgroup_id(cgroup);
+
+	/* Flush the stats to make sure we get the most updated numbers */
+	bpf_cgroup_rstat_flush(cgroup);
+
+	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		bpf_printk("error finding stats for cgroup %llu\n", cg_id);
+		return 0;
+	}
+	BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
+		       cg_id, total_stat->state);
+	return 0;
+}
+