From patchwork Tue Feb 1 20:55:30 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hao Luo X-Patchwork-Id: 12732299 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A37E2C4332F for ; Tue, 1 Feb 2022 20:55:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230122AbiBAUzn (ORCPT ); Tue, 1 Feb 2022 15:55:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59370 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229979AbiBAUzm (ORCPT ); Tue, 1 Feb 2022 15:55:42 -0500 Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com [IPv6:2607:f8b0:4864:20::44a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 816ECC06173B for ; Tue, 1 Feb 2022 12:55:42 -0800 (PST) Received: by mail-pf1-x44a.google.com with SMTP id z37-20020a056a001da500b004c74e3fd644so9760353pfw.1 for ; Tue, 01 Feb 2022 12:55:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=rThEDZn+lvseNJpmwu5NYRqQrzaWHFfOZgkrNvBxioY=; b=RsSjfhezq9hddP0+USmpiXrTisgiTKiRvti5DwTuxApn72KC6nEKeVjvD7k59IaZXA D5WnvZe6afi0RlFEwfMdlUaawBfEqeJngrfSYNkD14l9UX59UU+MwyZE9iyOu3y3rzo9 oiOI0xx/Foga/eJvY6XxI99tr7aT+Lpuo/7ZRjPAJbAxopJ8R2HzDInAmmkBV/U0bA1s BgP3xFSYParQKG7UVkYYYet/BlGBqZ094D28/H0s1VSGE6Hlx6yR68nFUKpP+RehXrpS n0Gd4edV9rAKLOUEzMx5mzLSC79IfApdjezjcDupD6o68MLv/37ov11L8R3IRTZ6r1tM xPoA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=rThEDZn+lvseNJpmwu5NYRqQrzaWHFfOZgkrNvBxioY=; b=eO8rywMxE1VrM6yMucG5u+MMXvJMvVGJ/mEDL4vcH8JcWNaUzMJLbpuIzK1VTJMkUE k3W7vdkfBlNYsOGBw+zJmi97s8kXp0aDl8FWowjSfLbmMWcUcmIDyb/lUcuLeMc9etL8 23K84a+fXgi+XxdXSEui10Kj4d4f9CMMN75tWMcRK6nDdNGwwi/PoHbAF31QjQNVKRBv 1IQSHuyzcHu+Mwow7Rx011bfAnac80WZtdX5luiP5qcTn4V2/5SKu4LREkE9KYP9iSTS UG1f1Er/0DUO+lXXVh57d0EYL+9trGOVP4Z2mcg6J0iGpr/AVzdmPVdyVrjgBmv6q10l Ke2w== X-Gm-Message-State: AOAM530RAAU0YSsC21dxfEJeb8wAlPqsQ9/G5+Us4aHlReb7jIU1XYCc UuJSrWjK9mtyzYlEIRWinzt0JkKarjY= X-Google-Smtp-Source: ABdhPJwdx6TAzMTb2rN2WINjSTvZbX5HKheIlI6PnxkT52N5BF3kbH+rc/Q9S5LU731La6cVfP7bs8vltJs= X-Received: from haoluo.svl.corp.google.com ([2620:15c:2cd:202:1cdb:a263:2495:80fd]) (user=haoluo job=sendgmr) by 2002:a17:90b:1bc3:: with SMTP id oa3mr4406017pjb.172.1643748941970; Tue, 01 Feb 2022 12:55:41 -0800 (PST) Date: Tue, 1 Feb 2022 12:55:30 -0800 In-Reply-To: <20220201205534.1962784-1-haoluo@google.com> Message-Id: <20220201205534.1962784-2-haoluo@google.com> Mime-Version: 1.0 References: <20220201205534.1962784-1-haoluo@google.com> X-Mailer: git-send-email 2.35.0.rc2.247.g8bbb082509-goog Subject: [PATCH RFC bpf-next v2 1/5] bpf: Bpffs directory tag From: Hao Luo To: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann Cc: Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Stanislav Fomichev , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Hao Luo Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC Introduce a tag structure for directories in bpffs. A tag carries special information about a directory. For example, a BPF_DIR_KERNFS_REP tag denotes that a directory is a replicate of a kernfs hierarchy. At mkdir, if the parent directory has a tag, the child directory also gets tag. For KERNFS_REP directories, the tag references a kernfs node. The KERNFS_REP hierarchy mirrors the hierarchy in kernfs. Userspace is responsible for sync'ing two hierarchies. The initial tag can be created by pinning a certain type of bpf objects. The following patches will introduce such objects and the tagged directory will mirror the cgroup hierarchy. Tags are destroyed at rmdir. Signed-off-by: Hao Luo --- kernel/bpf/inode.c | 80 +++++++++++++++++++++++++++++++++++++++++++++- kernel/bpf/inode.h | 22 +++++++++++++ 2 files changed, 101 insertions(+), 1 deletion(-) create mode 100644 kernel/bpf/inode.h diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 5a8d9f7467bf..ecc357009df5 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -16,11 +16,13 @@ #include #include #include +#include #include #include #include #include #include "preload/bpf_preload.h" +#include "inode.h" enum bpf_type { BPF_TYPE_UNSPEC = 0, @@ -142,6 +144,52 @@ static int bpf_inode_type(const struct inode *inode, enum bpf_type *type) return 0; } +static struct bpf_dir_tag *inode_tag(const struct inode *inode) +{ + if (unlikely(!S_ISDIR(inode->i_mode))) + return NULL; + + return inode->i_private; +} + +/* tag_dir_inode - tag a newly created directory. + * @tag: tag of parent directory + * @dentry: dentry of the new directory + * @inode: inode of the new directory + * + * Called from bpf_mkdir. + */ +static int tag_dir_inode(const struct bpf_dir_tag *tag, + const struct dentry *dentry, struct inode *inode) +{ + struct bpf_dir_tag *t; + struct kernfs_node *kn; + + WARN_ON(tag->type != BPF_DIR_KERNFS_REP); + + /* kn is put at tag deallocation. */ + kn = kernfs_find_and_get_ns(tag->private, dentry->d_name.name, NULL); + if (unlikely(!kn)) + return -ENOENT; + + if (unlikely(kernfs_type(kn) != KERNFS_DIR)) { + kernfs_put(kn); + return -EPERM; + } + + t = kzalloc(sizeof(struct bpf_dir_tag), GFP_KERNEL | __GFP_NOWARN); + if (unlikely(!t)) { + kernfs_put(kn); + return -ENOMEM; + } + + t->type = tag->type; + t->private = kn; + + inode->i_private = t; + return 0; +} + static void bpf_dentry_finalize(struct dentry *dentry, struct inode *inode, struct inode *dir) { @@ -156,6 +204,8 @@ static int bpf_mkdir(struct user_namespace *mnt_userns, struct inode *dir, struct dentry *dentry, umode_t mode) { struct inode *inode; + struct bpf_dir_tag *tag; + int err; inode = bpf_get_inode(dir->i_sb, dir, mode | S_IFDIR); if (IS_ERR(inode)) @@ -164,6 +214,15 @@ static int bpf_mkdir(struct user_namespace *mnt_userns, struct inode *dir, inode->i_op = &bpf_dir_iops; inode->i_fop = &simple_dir_operations; + tag = inode_tag(dir); + if (tag) { + err = tag_dir_inode(tag, dentry, inode); + if (err) { + iput(inode); + return err; + } + } + inc_nlink(inode); inc_nlink(dir); @@ -404,11 +463,30 @@ static int bpf_symlink(struct user_namespace *mnt_userns, struct inode *dir, return 0; } +static void untag_dir_inode(struct inode *dir) +{ + struct bpf_dir_tag *tag = inode_tag(dir); + + WARN_ON(tag->type != BPF_DIR_KERNFS_REP); + + dir->i_private = NULL; + kernfs_put(tag->private); + kfree(tag); +} + +static int bpf_rmdir(struct inode *dir, struct dentry *dentry) +{ + if (inode_tag(dir)) + untag_dir_inode(dir); + + return simple_rmdir(dir, dentry); +} + static const struct inode_operations bpf_dir_iops = { .lookup = bpf_lookup, .mkdir = bpf_mkdir, .symlink = bpf_symlink, - .rmdir = simple_rmdir, + .rmdir = bpf_rmdir, .rename = simple_rename, .link = simple_link, .unlink = simple_unlink, diff --git a/kernel/bpf/inode.h b/kernel/bpf/inode.h new file mode 100644 index 000000000000..2cfeef39e861 --- /dev/null +++ b/kernel/bpf/inode.h @@ -0,0 +1,22 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* Copyright (c) 2022 Google + */ +#ifndef __BPF_INODE_H_ +#define __BPF_INODE_H_ + +enum tag_type { + /* The directory is a replicate of a kernfs directory hierarchy. */ + BPF_DIR_KERNFS_REP = 0, +}; + +/* A tag for bpffs directories. It carries special information about a + * directory. For example, BPF_DIR_KERNFS_REP denotes that the directory is + * a replicate of a kernfs hierarchy. Pinning a certain type of objects tags + * a directory and the tag will be removed at rmdir. + */ +struct bpf_dir_tag { + enum tag_type type; + void *private; /* tag private data */ +}; + +#endif From patchwork Tue Feb 1 20:55:31 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hao Luo X-Patchwork-Id: 12732300 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96A52C433EF for ; Tue, 1 Feb 2022 20:55:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233855AbiBAUzt (ORCPT ); Tue, 1 Feb 2022 15:55:49 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59388 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231803AbiBAUzp (ORCPT ); Tue, 1 Feb 2022 15:55:45 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5C98FC06173B for ; Tue, 1 Feb 2022 12:55:45 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id e130-20020a255088000000b006126feb051eso35548323ybb.18 for ; Tue, 01 Feb 2022 12:55:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=zPUB3NCEXXiz6+mubQFrvt3pkt+Hiu6v49aq+5qNgdA=; b=nG2n54JlmG3+OOeTvBCYp5mh4jCaq3qOEY+zPFz6IXJegkVJgfuonqSpBZGtMzingX yTVHeIhvqQA/FTsbVjtbDL0QTZAKpJsOURMN4I9KfajdtTGBq7iJ4tHi+/P6ql3ctOez shL0NIZaY8GB4DTeo5YH0ZuKZVBFHI5qU61ICzxLJ67nJh06c4lRWGVhYwB+jRI9B1y8 hiSDf90S8zMxkLelqfRgF81OaPW2GFcGfIYmAlTdwoUfEWKjg2kpmVLQkamLXfpP+hp/ c6lxssgkd0eZE+4bz68vPp4EQedKA8jRElaKZzK+Tto1VoudK57mtTzCM/svm7+Y8MsN GRmQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=zPUB3NCEXXiz6+mubQFrvt3pkt+Hiu6v49aq+5qNgdA=; b=BDT4M3fffCHySWvZzilp0FI0HSKUbT8yTnTtxcYyTTN/i4M8li9WZehKUxuxD7P9bD tTT3OiMjENdNjo826Ow5EyzR8r2CAn7cKz6PAAymmHx9LQGN+oSOcvYeOS1HnlLAZsTR DnAtp/NcDYiToxMNtL5ndk2lrGn/FnqLgX2RgmQAI4MC21U49bR/Qw86mLAeVsA0umYU tNGbm1y3+crVmOiBHKIhpaa3nOZEG+9K1yRVE7ZkbEsw0spAVDSWbmaX6N6ENzX7jwXH xM90Ol/OzPzERev4WWdz8wqndR2qUA7lqz8keuSzil8CNIrC3TG1Bf4jP6da5k/OSbhp OGFA== X-Gm-Message-State: AOAM531Y/PlNIZiwCE853kUq2dqmR6CQtKnNXfcT2n1h27r9+4UTO30D RWp3QkX2EaQqxHibg0ZitkTdjmsbuSQ= X-Google-Smtp-Source: ABdhPJze7Krj8ejwFo366WPCbYzzm4Q3Mfn9Q5H6bxPKSCgnl/JbAbBlyieQcUP4BCeZgDYjNJPloFr1RKI= X-Received: from haoluo.svl.corp.google.com ([2620:15c:2cd:202:1cdb:a263:2495:80fd]) (user=haoluo job=sendgmr) by 2002:a25:557:: with SMTP id 84mr38786815ybf.637.1643748944511; Tue, 01 Feb 2022 12:55:44 -0800 (PST) Date: Tue, 1 Feb 2022 12:55:31 -0800 In-Reply-To: <20220201205534.1962784-1-haoluo@google.com> Message-Id: <20220201205534.1962784-3-haoluo@google.com> Mime-Version: 1.0 References: <20220201205534.1962784-1-haoluo@google.com> X-Mailer: git-send-email 2.35.0.rc2.247.g8bbb082509-goog Subject: [PATCH RFC bpf-next v2 2/5] bpf: Introduce inherit list for dir tag. From: Hao Luo To: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann Cc: Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Stanislav Fomichev , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Hao Luo Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC Embed a list of bpf objects in a directory's tag. This list is shared by all the directories in the tagged hierarchy. When a new tagged directory is created, it will be prepopulated with the objects in the inherit list. When the directory is removed, the inherited objects will be removed automatically. Because the whole tagged hierarchy share the same list, all the directories in the hierarchy have the same set of objects to be prepopulated. Signed-off-by: Hao Luo --- kernel/bpf/inode.c | 110 +++++++++++++++++++++++++++++++++++++++++---- kernel/bpf/inode.h | 33 ++++++++++++++ 2 files changed, 135 insertions(+), 8 deletions(-) diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index ecc357009df5..9ae17a2bf779 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -24,13 +24,6 @@ #include "preload/bpf_preload.h" #include "inode.h" -enum bpf_type { - BPF_TYPE_UNSPEC = 0, - BPF_TYPE_PROG, - BPF_TYPE_MAP, - BPF_TYPE_LINK, -}; - static void *bpf_any_get(void *raw, enum bpf_type type) { switch (type) { @@ -69,6 +62,20 @@ static void bpf_any_put(void *raw, enum bpf_type type) } } +static void free_obj_list(struct kref *kref) +{ + struct obj_list *list; + struct bpf_inherit_entry *e; + + list = container_of(kref, struct obj_list, refcnt); + list_for_each_entry(e, &list->list, list) { + list_del_rcu(&e->list); + bpf_any_put(e->obj, e->type); + kfree(e); + } + kfree(list); +} + static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type) { void *raw; @@ -100,6 +107,10 @@ static const struct inode_operations bpf_prog_iops = { }; static const struct inode_operations bpf_map_iops = { }; static const struct inode_operations bpf_link_iops = { }; +static int bpf_mkprog(struct dentry *dentry, umode_t mode, void *arg); +static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg); +static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg); + static struct inode *bpf_get_inode(struct super_block *sb, const struct inode *dir, umode_t mode) @@ -184,12 +195,62 @@ static int tag_dir_inode(const struct bpf_dir_tag *tag, } t->type = tag->type; + t->inherit_objects = tag->inherit_objects; + kref_get(&t->inherit_objects->refcnt); t->private = kn; inode->i_private = t; return 0; } +/* populate_dir - populate directory with bpf objects in a tag's + * inherit_objects. + * @dir: dentry of the directory. + * @inode: inode of the direcotry. + * + * Called from mkdir. Must be called after dentry has been finalized. + */ +static int populate_dir(struct dentry *dir, struct inode *inode) +{ + struct bpf_dir_tag *tag = inode_tag(inode); + struct bpf_inherit_entry *e; + struct dentry *child; + int ret; + + rcu_read_lock(); + list_for_each_entry_rcu(e, &tag->inherit_objects->list, list) { + child = lookup_one_len_unlocked(e->name.name, dir, + strlen(e->name.name)); + if (unlikely(IS_ERR(child))) { + ret = PTR_ERR(child); + break; + } + + switch (e->type) { + case BPF_TYPE_PROG: + ret = bpf_mkprog(child, e->mode, e->obj); + break; + case BPF_TYPE_MAP: + ret = bpf_mkmap(child, e->mode, e->obj); + break; + case BPF_TYPE_LINK: + ret = bpf_mklink(child, e->mode, e->obj); + break; + default: + ret = -EPERM; + break; + } + dput(child); + if (ret) + break; + + /* To match bpf_any_put in bpf_free_inode. */ + bpf_any_get(e->obj, e->type); + } + rcu_read_unlock(); + return ret; +} + static void bpf_dentry_finalize(struct dentry *dentry, struct inode *inode, struct inode *dir) { @@ -227,6 +288,12 @@ static int bpf_mkdir(struct user_namespace *mnt_userns, struct inode *dir, inc_nlink(dir); bpf_dentry_finalize(dentry, inode, dir); + + if (tag) { + err = populate_dir(dentry, inode); + if (err) + return err; + } return 0; } @@ -463,6 +530,30 @@ static int bpf_symlink(struct user_namespace *mnt_userns, struct inode *dir, return 0; } +/* unpopulate_dir - remove pre-populated entries from directory. + * @dentry: dentry of directory + * @inode: inode of directory + * + * Called from rmdir. + */ +static void unpopulate_dir(struct dentry *dentry, struct inode *inode) +{ + struct bpf_dir_tag *tag = inode_tag(inode); + struct bpf_inherit_entry *e; + struct dentry *child; + + rcu_read_lock(); + list_for_each_entry_rcu(e, &tag->inherit_objects->list, list) { + child = d_hash_and_lookup(dentry, &e->name); + if (unlikely(IS_ERR(child))) + continue; + + simple_unlink(inode, child); + dput(child); + } + rcu_read_unlock(); +} + static void untag_dir_inode(struct inode *dir) { struct bpf_dir_tag *tag = inode_tag(dir); @@ -471,13 +562,16 @@ static void untag_dir_inode(struct inode *dir) dir->i_private = NULL; kernfs_put(tag->private); + kref_put(&tag->inherit_objects->refcnt, free_obj_list); kfree(tag); } static int bpf_rmdir(struct inode *dir, struct dentry *dentry) { - if (inode_tag(dir)) + if (inode_tag(dir)) { + unpopulate_dir(dentry, dir); untag_dir_inode(dir); + } return simple_rmdir(dir, dentry); } diff --git a/kernel/bpf/inode.h b/kernel/bpf/inode.h index 2cfeef39e861..a8207122643d 100644 --- a/kernel/bpf/inode.h +++ b/kernel/bpf/inode.h @@ -4,11 +4,42 @@ #ifndef __BPF_INODE_H_ #define __BPF_INODE_H_ +#include +#include + +enum bpf_type { + BPF_TYPE_UNSPEC = 0, + BPF_TYPE_PROG, + BPF_TYPE_MAP, + BPF_TYPE_LINK, +}; + enum tag_type { /* The directory is a replicate of a kernfs directory hierarchy. */ BPF_DIR_KERNFS_REP = 0, }; +/* Entry for bpf_dir_tag->inherit_objects. + * + * When a new directory is created from a tagged directory, the new directory + * will be populated with bpf objects in the tag's inherit_objects list. Each + * entry holds a reference of a bpf object and the information needed to + * recreate the object's entry in the new directory. + */ +struct bpf_inherit_entry { + struct list_head list; + void *obj; /* bpf object to inherit. */ + enum bpf_type type; /* type of the object (prog, map or link). */ + struct qstr name; /* name of the entry. */ + umode_t mode; /* access mode of the entry. */ +}; + +struct obj_list { + struct list_head list; + struct kref refcnt; + struct inode *root; +}; + /* A tag for bpffs directories. It carries special information about a * directory. For example, BPF_DIR_KERNFS_REP denotes that the directory is * a replicate of a kernfs hierarchy. Pinning a certain type of objects tags @@ -16,6 +47,8 @@ enum tag_type { */ struct bpf_dir_tag { enum tag_type type; + /* list of bpf objects that a directory inherits from its parent. */ + struct obj_list *inherit_objects; void *private; /* tag private data */ }; From patchwork Tue Feb 1 20:55:32 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hao Luo X-Patchwork-Id: 12732302 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D340DC4332F for ; Tue, 1 Feb 2022 20:55:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231757AbiBAUzu (ORCPT ); Tue, 1 Feb 2022 15:55:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59400 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231435AbiBAUzr (ORCPT ); Tue, 1 Feb 2022 15:55:47 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A0EDAC06173B for ; Tue, 1 Feb 2022 12:55:47 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id t91-20020a25aae4000000b0061963cce3c1so23022710ybi.11 for ; Tue, 01 Feb 2022 12:55:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=dm/QA8eOMDyEhqviMMC11srMRKdcrEoIlVOAi9iuAaY=; b=EOpU/04WC7Oa2ypb8D//Kiwq9aBRceZLH54HJIBWJcN5sfbyJClGGQd3F6+YQNVBDQ jdJZd32Fxzn1EOKRz+B6vMy1XkIVgutEM7xSmFjH1isBkX9PnzFOJ2W07aCMDp9hjaFT Uh/XWgmOlvSPqHhBb7qlZ7XOcwvovINKjGp23bvs/3SzI48jMQM92CFVhoX8Tfl2pC0S QtLWTpdraQA72kWfRUivi9scYXS1M3M2zT+YhXfMlKzTrxchFyTpw9+8JhubV0++SgP6 cXhWvvhQ5cj5yZm2a7bzwRYwN78qEn1TwnAuhHmFYzj8AWvrJ7CEEEd/40O8S+dp6uNn txiA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=dm/QA8eOMDyEhqviMMC11srMRKdcrEoIlVOAi9iuAaY=; b=PNFznF9Ll+TnLcTL1iexSfHE8Rc8l5Ga6dXNkQuamTKpKrg1z1J3u8R39DTHV+wRj5 watI5ghP6m3ZEfEeXFQ1gXiDu2WC3DIFm7ocOWhwObUkRS5ZS64wZZtZQb51aTd/x17C kW4uyscRX1UaaiBNeUxccv9j/UkqYOKw3DSERRaHDqeVL82ImpdIYsPdBN3VYtxjuW19 wq6Dr6U/SNasrA3ckxAdhKaRbEJCj3yIks/+5BRx6Y8Ix3gOMzO1FCYRgAjgcfja9I4x nTlwQagsRDFCN6HqnxOgsAUtM0/XDE49zFPd6uUQuplFWrJHqM0/KwgrVuQLYuJAXcE/ +fyg== X-Gm-Message-State: AOAM532tIBSGkIq0jduh1O3d8/0LijBhCAJEcqDzE6GrsfYb8BySFVwT tH8bCj5n/IKy2K5pWWFnFpe5d3Ujmi4= X-Google-Smtp-Source: ABdhPJzDi2kBw0BS06ZEeFeTZkDZ0K3lIomhmC8BXon0zUnggjPKNcyl2sAP1ZdF6cD1y5QNjzBtlyYQEJs= X-Received: from haoluo.svl.corp.google.com ([2620:15c:2cd:202:1cdb:a263:2495:80fd]) (user=haoluo job=sendgmr) by 2002:a81:1a55:: with SMTP id a82mr88441ywa.369.1643748946816; Tue, 01 Feb 2022 12:55:46 -0800 (PST) Date: Tue, 1 Feb 2022 12:55:32 -0800 In-Reply-To: <20220201205534.1962784-1-haoluo@google.com> Message-Id: <20220201205534.1962784-4-haoluo@google.com> Mime-Version: 1.0 References: <20220201205534.1962784-1-haoluo@google.com> X-Mailer: git-send-email 2.35.0.rc2.247.g8bbb082509-goog Subject: [PATCH RFC bpf-next v2 3/5] bpf: cgroup_view iter From: Hao Luo To: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann Cc: Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Stanislav Fomichev , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Hao Luo Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC Introduce a new type of iter prog: 'cgroup_view'. It prints out cgroup's state. Cgroup_view is supposed to be used together with directory tagging. When cgroup_view is pinned in a directory, it tags that directory as KERNFS_REP, i.e. a replicate of the cgroup hierarchy. Whenever a subdirectory is created, if there is a child cgroup of the same name exists, the subdirectory inherits the pinned cgroup_view object from its parent and holds a reference of the corresponding kernfs node. The cgroup_view prog takes a pointer to the cgroup and can use family of seq_print helpers to print out cgroup state. A typical use case of cgroup_view is to extend the cgroupfs interface. Signed-off-by: Hao Luo --- include/linux/bpf.h | 2 + kernel/bpf/Makefile | 2 +- kernel/bpf/bpf_iter.c | 11 ++++ kernel/bpf/cgroup_view_iter.c | 114 ++++++++++++++++++++++++++++++++++ 4 files changed, 128 insertions(+), 1 deletion(-) create mode 100644 kernel/bpf/cgroup_view_iter.c diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 6eb0b180d33b..494927b2b3c2 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1610,6 +1610,7 @@ typedef const struct bpf_func_proto * enum bpf_iter_feature { BPF_ITER_RESCHED = BIT(0), + BPF_ITER_INHERIT = BIT(1), }; #define BPF_ITER_CTX_ARG_MAX 2 @@ -1647,6 +1648,7 @@ bpf_iter_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog); int bpf_iter_link_attach(const union bpf_attr *attr, bpfptr_t uattr, struct bpf_prog *prog); int bpf_iter_new_fd(struct bpf_link *link); bool bpf_link_is_iter(struct bpf_link *link); +bool bpf_link_support_inherit(struct bpf_link *link); struct bpf_prog *bpf_iter_get_info(struct bpf_iter_meta *meta, bool in_stop); int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx); void bpf_iter_map_show_fdinfo(const struct bpf_iter_aux_info *aux, diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index c1a9be6a4b9f..d9d2b8541ba7 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy) obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_view_iter.o obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o obj-$(CONFIG_BPF_SYSCALL) += disasm.o diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c index 110029ede71e..ff5577a5f73a 100644 --- a/kernel/bpf/bpf_iter.c +++ b/kernel/bpf/bpf_iter.c @@ -496,6 +496,17 @@ bool bpf_link_is_iter(struct bpf_link *link) return link->ops == &bpf_iter_link_lops; } +bool bpf_link_support_inherit(struct bpf_link *link) +{ + struct bpf_iter_link *iter_link; + + if (!bpf_link_is_iter(link)) + return false; + + iter_link = container_of(link, struct bpf_iter_link, link); + return iter_link->tinfo->reg_info->feature & BPF_ITER_INHERIT; +} + int bpf_iter_link_attach(const union bpf_attr *attr, bpfptr_t uattr, struct bpf_prog *prog) { diff --git a/kernel/bpf/cgroup_view_iter.c b/kernel/bpf/cgroup_view_iter.c new file mode 100644 index 000000000000..a44d115235c4 --- /dev/null +++ b/kernel/bpf/cgroup_view_iter.c @@ -0,0 +1,114 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2022 Google */ +#include +#include +#include +#include +#include +#include +#include +#include "inode.h" + +static void *cgroup_view_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct bpf_dir_tag *tag; + struct kernfs_node *kn; + struct cgroup *cgroup; + struct inode *dir; + + /* Only one session is supported. */ + if (*pos > 0) + return NULL; + + dir = d_inode(seq->file->f_path.dentry->d_parent); + tag = dir->i_private; + if (!tag) + return NULL; + + kn = tag->private; + + rcu_read_lock(); + cgroup = rcu_dereference(*(void __rcu __force **)&kn->priv); + if (!cgroup || !cgroup_tryget(cgroup)) + cgroup = NULL; + rcu_read_unlock(); + + if (!cgroup) + return NULL; + + if (*pos == 0) + ++*pos; + return cgroup; +} + +static void *cgroup_view_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + ++*pos; + return NULL; +} + +struct bpf_iter__cgroup_view { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct cgroup *, cgroup); +}; + +DEFINE_BPF_ITER_FUNC(cgroup_view, struct bpf_iter_meta *meta, struct cgroup *cgroup) + +static int cgroup_view_seq_show(struct seq_file *seq, void *v) +{ + struct bpf_iter__cgroup_view ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + int ret = 0; + + ctx.meta = &meta; + ctx.cgroup = v; + meta.seq = seq; + prog = bpf_iter_get_info(&meta, false); + if (prog) + ret = bpf_iter_run_prog(prog, &ctx); + + return ret; +} + +static void cgroup_view_seq_stop(struct seq_file *seq, void *v) +{ + if (v) + cgroup_put(v); +} + +static const struct seq_operations cgroup_view_seq_ops = { + .start = cgroup_view_seq_start, + .next = cgroup_view_seq_next, + .stop = cgroup_view_seq_stop, + .show = cgroup_view_seq_show, +}; + +BTF_ID_LIST(btf_cgroup_id) +BTF_ID(struct, cgroup) + +static const struct bpf_iter_seq_info cgroup_view_seq_info = { + .seq_ops = &cgroup_view_seq_ops, + .init_seq_private = NULL, + .fini_seq_private = NULL, + .seq_priv_size = 0, +}; + +static struct bpf_iter_reg cgroup_view_reg_info = { + .target = "cgroup_view", + .feature = BPF_ITER_INHERIT, + .ctx_arg_info_size = 1, + .ctx_arg_info = { + { offsetof(struct bpf_iter__cgroup_view, cgroup), + PTR_TO_BTF_ID }, + }, + .seq_info = &cgroup_view_seq_info, +}; + +static int __init cgroup_view_init(void) +{ + cgroup_view_reg_info.ctx_arg_info[0].btf_id = *btf_cgroup_id; + return bpf_iter_reg_target(&cgroup_view_reg_info); +} + +late_initcall(cgroup_view_init); From patchwork Tue Feb 1 20:55:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hao Luo X-Patchwork-Id: 12732301 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07049C433F5 for ; Tue, 1 Feb 2022 20:55:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234734AbiBAUzw (ORCPT ); Tue, 1 Feb 2022 15:55:52 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59426 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231216AbiBAUzu (ORCPT ); Tue, 1 Feb 2022 15:55:50 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D19A6C06174A for ; Tue, 1 Feb 2022 12:55:49 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id c76-20020a25c04f000000b00613e2c514e2so34888595ybf.21 for ; Tue, 01 Feb 2022 12:55:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=TrRMtZ7rpWFBy3mN9jukrWT5/O4hvi4cjTw+7/27mDs=; b=BPfKMhRpkmr59DmiYl06B1vd1XGK2BTPFlXloc47Xbchd+K/D+0T1O0zu3r4KG8C+J MJ/jYs/i09fQ+Cew4BlHmC+o3+WajaBpAGlrdUGMA5vUBqPiFevv4KSVLTJus8MkWVqY RuD/0Z0NzzF4XEAwu4Ut3Vx4z9TYtxa1XmuQ8j4mB9KNcFx4NqmqTTyTVmFb6pO9Y4Mi mhXSXxERDH4vv8UFz50UA6wr4aeDcnSobbJty/OmrZ1sw4QlPzFtIhCzosR2M9DQ6YoG Pedp3zA2Hnkm5ncjXAJ9tHGc7IbOi8BvkurHiApQh8C2x4IpF3Hr4EeUYH9pjYaH5m/+ nsig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=TrRMtZ7rpWFBy3mN9jukrWT5/O4hvi4cjTw+7/27mDs=; b=tpkDqWn8jlmbd9BVi0EthrG7DtQTOctl6bGpFx3v/wY1oU87CiIhy0AZ4CWvJpmnrf JIaSKlrJCY8MK85y2HOK+74ZLE1dxG8qG+FIC3YaE3BEFZrZ3RTLEqV/7ytVvDpOyczW 3gPSDKhNIcmJUgMJVkF1vYbSl0OBX/st5201ohfLy9uLWBqOz3SAPpswfOdmEYYzFUJs 5cp+RcYOsip1HFZ5p0tT7P3g+XNyrXjBgPdKD79qcRdBg4BZURVGf8E+TJJ+/+QLlApb DJw7s4wtsX3mBsZBYaJELsJLcNsRuzaSxMuieqYPOgQWhBo1NFswDc2KqYzWCJblFZky OukA== X-Gm-Message-State: AOAM530kCloRiTeHJ/UhLABpDUJaJsgZNtsTnBzbMosMcGWhh7UsFPT2 YMhgE3SjKTE/X2+oUcgAc4O+fg8b+7g= X-Google-Smtp-Source: ABdhPJzyYTbnOcjPUT6K25yIzD0kbUD9e8z1MZA8T/wWw87xYomk8IAVurjv9l7i2v1Png4zCo10PSVDCeg= X-Received: from haoluo.svl.corp.google.com ([2620:15c:2cd:202:1cdb:a263:2495:80fd]) (user=haoluo job=sendgmr) by 2002:a0d:d782:: with SMTP id z124mr93207ywd.28.1643748949022; Tue, 01 Feb 2022 12:55:49 -0800 (PST) Date: Tue, 1 Feb 2022 12:55:33 -0800 In-Reply-To: <20220201205534.1962784-1-haoluo@google.com> Message-Id: <20220201205534.1962784-5-haoluo@google.com> Mime-Version: 1.0 References: <20220201205534.1962784-1-haoluo@google.com> X-Mailer: git-send-email 2.35.0.rc2.247.g8bbb082509-goog Subject: [PATCH RFC bpf-next v2 4/5] bpf: Pin cgroup_view From: Hao Luo To: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann Cc: Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Stanislav Fomichev , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Hao Luo Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC When pinning cgroup_view iter into bpffs, the pinning adds a tag to the parent directory, indicating the underneath directory hierarchy is a replicate of the cgroup hierarchy. Each of the directory connects to a cgroup directory. Whenever a subdirectory is created, if there is a subcgroup of the same name exists, the subdirectory will be populated with entries holding a list of bpf objects registered in the tag's inherit list. The inherit list is formed by the objects pinned in the top level tagged directory. For example, bpf_obj_pin(cgroup_view_link, "/sys/fs/bpf/A/obj"); pins a link in A. A becomes tagged and the link object is registered in the inherit list of A's tag. mkdir("/sys/fs/bpf/A/B"); When A/B is created, B inherits the pinned objects in A. B is populated with objects. > ls /sys/fs/bpf/A/B obj Currently, only pinning cgroup_view link can tag a directory. And once tagged, only rmdir can remove the tag. Signed-off-by: Hao Luo --- kernel/bpf/inode.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 84 insertions(+) diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 9ae17a2bf779..b71840bf979d 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -71,6 +71,7 @@ static void free_obj_list(struct kref *kref) list_for_each_entry(e, &list->list, list) { list_del_rcu(&e->list); bpf_any_put(e->obj, e->type); + kfree(e->name.name); kfree(e); } kfree(list); @@ -486,9 +487,20 @@ static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg) &bpffs_map_fops : &bpffs_obj_fops); } +static int +bpf_inherit_object(struct dentry *dentry, umode_t mode, void *obj, + enum bpf_type type); + static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg) { struct bpf_link *link = arg; + int err; + + if (bpf_link_support_inherit(link)) { + err = bpf_inherit_object(dentry, mode, link, BPF_TYPE_LINK); + if (err) + return err; + } return bpf_mkobj_ops(dentry, mode, arg, &bpf_link_iops, bpf_link_is_iter(link) ? @@ -586,6 +598,78 @@ static const struct inode_operations bpf_dir_iops = { .unlink = simple_unlink, }; +/* bpf_inherit_object - register an object in a diretory tag's inherit list + * @dentry: dentry of the location to pin + * @mode: mode of created file entry + * @obj: bpf object + * @type: type of bpf object + * + * Could be called from bpf_obj_do_pin() or from mkdir(). + */ +static int bpf_inherit_object(struct dentry *dentry, umode_t mode, + void *obj, enum bpf_type type) +{ + struct inode *dir = d_inode(dentry->d_parent); + struct obj_list *inherits; + struct bpf_inherit_entry *e; + struct bpf_dir_tag *tag; + const char *name; + bool queued = false, new_tag = false; + + /* allocate bpf_dir_tag */ + tag = inode_tag(dir); + if (!tag) { + new_tag = true; + tag = kzalloc(sizeof(struct bpf_dir_tag), GFP_KERNEL); + if (unlikely(!tag)) + return -ENOMEM; + + tag->type = BPF_DIR_KERNFS_REP; + inherits = kzalloc(sizeof(struct obj_list), GFP_KERNEL); + if (unlikely(!inherits)) { + kfree(tag); + return -ENOMEM; + } + + kref_init(&inherits->refcnt); + INIT_LIST_HEAD(&inherits->list); + tag->inherit_objects = inherits; + /* initial tag points to the default root cgroup. */ + tag->private = cgrp_dfl_root.kf_root->kn; + dir->i_private = tag; + } else { + inherits = tag->inherit_objects; + } + + list_for_each_entry_rcu(e, &inherits->list, list) { + if (!strcmp(dentry->d_name.name, e->name.name)) { + queued = true; + break; + } + } + + /* queue in tag's inherit_list. */ + if (!queued) { + e = kzalloc(sizeof(struct bpf_inherit_entry), GFP_KERNEL); + if (!e) { + if (new_tag) { + kfree(tag); + kfree(inherits); + } + return -ENOMEM; + } + + INIT_LIST_HEAD(&e->list); + e->mode = mode; + e->obj = obj; + e->type = type; + name = kstrdup(dentry->d_name.name, GFP_USER | __GFP_NOWARN); + e->name = (struct qstr)QSTR_INIT(name, strlen(name)); + list_add_rcu(&e->list, &inherits->list); + } + return 0; +} + /* pin iterator link into bpffs */ static int bpf_iter_link_pin_kernel(struct dentry *parent, const char *name, struct bpf_link *link) From patchwork Tue Feb 1 20:55:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hao Luo X-Patchwork-Id: 12732303 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F036C433EF for ; Tue, 1 Feb 2022 20:55:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232591AbiBAUzy (ORCPT ); Tue, 1 Feb 2022 15:55:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59442 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232766AbiBAUzw (ORCPT ); Tue, 1 Feb 2022 15:55:52 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 55A08C06173B for ; Tue, 1 Feb 2022 12:55:52 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id u185-20020a2560c2000000b0060fd98540f7so35664485ybb.0 for ; Tue, 01 Feb 2022 12:55:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=pL1E8fJxKw13cCsiOnlzxl+3OEwkHyaKnHI4u2VvDps=; b=iLwxZIv5uqjajpGnCS3rmRzTL28QDNo1wzcLG4pKrQAIir+6+XS5Oth4WrIeoROhaC K4x9s+1s0lVKIYmOYxqRr2YllIsLUD/Fd2Kr/4loQYX1M8KrWo+hswzpdIkl0MOVnLI5 qyyQptR7MsdpfC7LZCBs+JmxPf700zWwkxD6xiTZRkk8IcluJD7XR60iDT01WAl6DP+G ygG2vSjX8X5sGmeHwzxsFlPFFeaxnrVNuesRujenB9aTqUvV2ILCLTW4bckP+y9yJhuw HejzUN2pY4TY054MLd4OIijtaAGbHt0mdUDYd2hE2Iywrk/r50G5z4cfcIE5K7DBjaNO +a2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=pL1E8fJxKw13cCsiOnlzxl+3OEwkHyaKnHI4u2VvDps=; b=quDmCh64flj2plZt4Qfou0mIryFx71ROYRklJS40Nt5cSZKEoOLyzdvIK1HXgVnFP+ MAGGxGCM2hRIl03UIzDsUYmhsnRyF3kNEOPwo9QH+X4aPJAD5Thk+iWWz9B9R0FWYpwe FrdmZ21b25JJ5Ir+f4fB5ZWCSlvSBJL/jDSroNjqo07FTq7gz0OYvPLb331z//iQWF2k 3X56HD+cEtDRN4HsNsZ8Tuxaor9SzviMmqT2HsS/wKy7sZt26ksMTc7+Mil3KRVc153H bkLr9neiPtpaKXUPA4ij0ZOjLJU7i90xWwKjKOCFm6A4KuOt7pUmJjurRWMfDCYKfvRR QNTw== X-Gm-Message-State: AOAM5323OS68403w4Wxt9exzO8ArSXr/ukDujV+6D+jICqYCOZ8+5V0U wQ4tGBSJJKkG9JotK8BQ3IlvOiehOeY= X-Google-Smtp-Source: ABdhPJx2K4/2udytZZXqNqYSXlX+Tejacq3qe7zEpZvsjURu1Lo9LnFuORhx1dED7XoxakLOdMSUBRGlftA= X-Received: from haoluo.svl.corp.google.com ([2620:15c:2cd:202:1cdb:a263:2495:80fd]) (user=haoluo job=sendgmr) by 2002:a25:c5cc:: with SMTP id v195mr41095321ybe.373.1643748951613; Tue, 01 Feb 2022 12:55:51 -0800 (PST) Date: Tue, 1 Feb 2022 12:55:34 -0800 In-Reply-To: <20220201205534.1962784-1-haoluo@google.com> Message-Id: <20220201205534.1962784-6-haoluo@google.com> Mime-Version: 1.0 References: <20220201205534.1962784-1-haoluo@google.com> X-Mailer: git-send-email 2.35.0.rc2.247.g8bbb082509-goog Subject: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link From: Hao Luo To: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann Cc: Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Stanislav Fomichev , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Hao Luo Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC A selftest for demo/testing cgroup_view pinning, directory tagging and cat cgroup_view pinned objects. The added cgroup_view prog is a naive use case, which merely reads the cgroup's id. This selftest introduces two metrics for measuring cgroup-level scheduling queuing delays: - queue_self: queueing delays caused by waiting behind self cgroup. - queue_other: queueing delays caused by waiting behind other cgroups. The selftest collects task-level queuing delays and breaks them into queue_self delays and queue_other delays. This is done by hooking at handlers at context switch and wakeup. Only the max queue_self and queue_other delays are recorded. This breakdown is helpful in analyzing the cause of long scheduling delays. Large value in queue_self is an indication of contention that comes from within the cgroup. Large value in queue_other indicates contention between cgroups. A new iter prog type cgroup_view is implemented "dump_cgropu_lat", which reads the recorded delays and dumps the stats through bpffs interfaces. Specifically, cgroup_view is initially pinned in an empty directory bpffs, which effectively turns the directory into a mirror of the cgroup hierarchy. When a new cgroup is created, we manually created a new directory in bpffs in correspondence. The new directory will contain a file prepopulated with the pinned cgroup_view object. Reading that file yields the queue stats of the corresponding cgroup. Signed-off-by: Hao Luo --- .../selftests/bpf/prog_tests/pinning_cgroup.c | 143 +++++++++++ tools/testing/selftests/bpf/progs/bpf_iter.h | 7 + .../bpf/progs/bpf_iter_cgroup_view.c | 232 ++++++++++++++++++ 3 files changed, 382 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c diff --git a/tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c b/tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c new file mode 100644 index 000000000000..ebef154e63c9 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c @@ -0,0 +1,143 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include +#include "bpf_iter_cgroup_view.skel.h" + +static void spin_on_cpu(int seconds) +{ + time_t start, now; + + start = time(NULL); + do { + now = time(NULL); + } while (now - start < seconds); +} + +static void do_work(const char *cgroup) +{ + int i, cpu = 0, pid; + char cmd[128]; + + /* make cgroup threaded */ + snprintf(cmd, 128, "echo threaded > %s/cgroup.type", cgroup); + system(cmd); + + /* try to enable cpu controller. this may fail if there cpu controller + * is not available in cgroup.controllers or there is a cgroup v1 already + * mounted in the system. + */ + snprintf(cmd, 128, "echo \"+cpu\" > %s/cgroup.subtree_control", cgroup); + system(cmd); + + /* launch two children, both running in child cgroup */ + for (i = 0; i < 2; ++i) { + pid = fork(); + if (pid == 0) { + /* attach to cgroup */ + snprintf(cmd, 128, "echo %d > %s/cgroup.procs", getpid(), cgroup); + system(cmd); + + /* pin process to target cpu */ + snprintf(cmd, 128, "taskset -pc %d %d", cpu, getpid()); + system(cmd); + + spin_on_cpu(3); /* spin on cpu for 3 seconds */ + exit(0); + } + } + + /* pin process to target cpu */ + snprintf(cmd, 128, "taskset -pc %d %d", cpu, getpid()); + system(cmd); + + spin_on_cpu(3); /* spin on cpu for 3 seconds */ + wait(NULL); +} + +static void check_pinning(const char *rootpath) +{ + const char *child_cgroup = "/sys/fs/cgroup/child"; + struct bpf_iter_cgroup_view *skel; + struct bpf_link *link; + struct stat statbuf = {}; + FILE *file; + unsigned long queue_self, queue_other; + int cgroup_id, link_fd; + char path[64]; + char buf[64]; + + skel = bpf_iter_cgroup_view__open_and_load(); + if (!ASSERT_OK_PTR(skel, "bpf_iter_cgroup_view__open_and_load")) + return; + + /* pin path at parent dir. */ + link = bpf_program__attach_iter(skel->progs.dump_cgroup_lat, NULL); + link_fd = bpf_link__fd(link); + + /* test initial pinning */ + snprintf(path, 64, "%s/obj", rootpath); + ASSERT_OK(bpf_obj_pin(link_fd, path), "bpf_obj_pin"); + ASSERT_OK(stat(path, &statbuf), "pinned_object_exists"); + + /* test mkdir */ + mkdir(child_cgroup, 0755); + snprintf(path, 64, "%s/child", rootpath); + ASSERT_OK(mkdir(path, 0755), "mkdir"); + + /* test that new dir has been pre-populated with pinned objects */ + snprintf(path, 64, "%s/child/obj", rootpath); + ASSERT_OK(stat(path, &statbuf), "populate"); + + bpf_iter_cgroup_view__attach(skel); + do_work(child_cgroup); + bpf_iter_cgroup_view__detach(skel); + + /* test cat inherited objects */ + file = fopen(path, "r"); + if (ASSERT_OK_PTR(file, "open")) { + ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat"); + ASSERT_EQ(sscanf(buf, "cgroup_id: %8d", &cgroup_id), 1, "output"); + + ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat"); + ASSERT_EQ(sscanf(buf, "queue_self: %8lu", &queue_self), 1, "output"); + + ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat"); + ASSERT_EQ(sscanf(buf, "queue_other: %8lu", &queue_other), 1, "output"); + + fclose(file); + } + + /* test rmdir */ + snprintf(path, 64, "%s/child", rootpath); + ASSERT_OK(rmdir(path), "rmdir"); + + /* unpin object */ + snprintf(path, 64, "%s/obj", rootpath); + ASSERT_OK(unlink(path), "unlink"); + + bpf_link__destroy(link); + bpf_iter_cgroup_view__destroy(skel); +} + +void test_pinning_cgroup(void) +{ + char tmpl[] = "/sys/fs/bpf/pinning_test_XXXXXX"; + char *rootpath; + + system("mount -t cgroup2 none /sys/fs/cgroup"); + system("mount -t bpf bpffs /sys/fs/bpf"); + + rootpath = mkdtemp(tmpl); + chmod(rootpath, 0755); + + /* check pinning map, prog and link in kernfs */ + if (test__start_subtest("pinning")) + check_pinning(rootpath); + + rmdir(rootpath); +} diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/selftests/bpf/progs/bpf_iter.h index 8cfaeba1ddbf..506bb3efd9b4 100644 --- a/tools/testing/selftests/bpf/progs/bpf_iter.h +++ b/tools/testing/selftests/bpf/progs/bpf_iter.h @@ -16,6 +16,7 @@ #define bpf_iter__bpf_map_elem bpf_iter__bpf_map_elem___not_used #define bpf_iter__bpf_sk_storage_map bpf_iter__bpf_sk_storage_map___not_used #define bpf_iter__sockmap bpf_iter__sockmap___not_used +#define bpf_iter__cgroup_view bpf_iter__cgroup_view___not_used #define btf_ptr btf_ptr___not_used #define BTF_F_COMPACT BTF_F_COMPACT___not_used #define BTF_F_NONAME BTF_F_NONAME___not_used @@ -37,6 +38,7 @@ #undef bpf_iter__bpf_map_elem #undef bpf_iter__bpf_sk_storage_map #undef bpf_iter__sockmap +#undef bpf_iter__cgroup_view #undef btf_ptr #undef BTF_F_COMPACT #undef BTF_F_NONAME @@ -132,6 +134,11 @@ struct bpf_iter__sockmap { struct sock *sk; }; +struct bpf_iter__cgroup_view { + struct bpf_iter_meta *meta; + struct cgroup *cgroup; +} __attribute__((preserve_access_index)); + struct btf_ptr { void *ptr; __u32 type_id; diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c b/tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c new file mode 100644 index 000000000000..43404c21aee3 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c @@ -0,0 +1,232 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2022 Google */ +#include "bpf_iter.h" +#include +#include +#include + +char _license[] SEC("license") = "GPL"; + +#define TASK_RUNNING 0 +#define BPF_F_CURRENT_CPU 0xffffffffULL + +extern void fair_sched_class __ksym; +extern bool CONFIG_FAIR_GROUP_SCHED __kconfig; +extern bool CONFIG_CGROUP_SCHED __kconfig; + +struct wait_lat { + /* Queue_self stands for the latency a task experiences while waiting + * behind the tasks that are from the same cgroup. + * + * Queue_other stands for the latency a task experiences while waiting + * behind the tasks that are from other cgroups. + * + * For example, if there are three tasks: A, B and C. Suppose A and B + * are in the same cgroup and C is in another cgroup and we see A has + * a queueing latency X milliseconds. Let's say during the X milliseconds, + * B has run for Y milliseconds. We can break down X to two parts: time + * when B is on cpu, that is Y; the time when C is on cpu, that is X - Y. + * + * Queue_self is the former (Y) while queue_other is the latter (X - Y). + * + * large value in queue_self is an indication of contention within a + * cgroup; while large value in queue_other is an indication of + * contention from multiple cgroups. + */ + u64 queue_self; + u64 queue_other; +}; + +struct timestamp { + /* timestamp when last queued */ + u64 tsp; + + /* cgroup exec_clock when last queued */ + u64 exec_clock; +}; + +/* Map to store per-cgroup wait latency */ +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, u64); + __type(value, struct wait_lat); + __uint(max_entries, 65532); +} cgroup_lat SEC(".maps"); + +/* Map to store per-task queue timestamp */ +struct { + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct timestamp); +} start SEC(".maps"); + +/* adapt from task_cfs_rq in kernel/sched/sched.h */ +__always_inline +struct cfs_rq *task_cfs_rq(struct task_struct *t) +{ + if (!CONFIG_FAIR_GROUP_SCHED) + return NULL; + + return BPF_CORE_READ(&t->se, cfs_rq); +} + +/* record enqueue timestamp */ +__always_inline +static int trace_enqueue(struct task_struct *t) +{ + u32 pid = t->pid; + struct timestamp *ptr; + struct cfs_rq *cfs_rq; + + if (!pid) + return 0; + + /* only measure for CFS tasks */ + if (t->sched_class != &fair_sched_class) + return 0; + + ptr = bpf_task_storage_get(&start, t, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!ptr) + return 0; + + /* CONFIG_FAIR_GROUP_SCHED may not be enabled */ + cfs_rq = task_cfs_rq(t); + if (!cfs_rq) + return 0; + + ptr->tsp = bpf_ktime_get_ns(); + ptr->exec_clock = BPF_CORE_READ(cfs_rq, exec_clock); + return 0; +} + +SEC("tp_btf/sched_wakeup") +int handle__sched_wakeup(u64 *ctx) +{ + /* TP_PROTO(struct task_struct *p) */ + struct task_struct *p = (void *)ctx[0]; + + return trace_enqueue(p); +} + +SEC("tp_btf/sched_wakeup_new") +int handle__sched_wakeup_new(u64 *ctx) +{ + /* TP_PROTO(struct task_struct *p) */ + struct task_struct *p = (void *)ctx[0]; + + return trace_enqueue(p); +} + +/* task_group() from kernel/sched/sched.h */ +__always_inline +struct task_group *task_group(struct task_struct *p) +{ + if (!CONFIG_CGROUP_SCHED) + return NULL; + + return BPF_CORE_READ(p, sched_task_group); +} + +__always_inline +struct cgroup *task_cgroup(struct task_struct *p) +{ + struct task_group *tg; + + tg = task_group(p); + if (!tg) + return NULL; + + return BPF_CORE_READ(tg, css).cgroup; +} + +__always_inline +u64 max(u64 x, u64 y) +{ + return x > y ? x : y; +} + +SEC("tp_btf/sched_switch") +int handle__sched_switch(u64 *ctx) +{ + /* TP_PROTO(bool preempt, struct task_struct *prev, + * struct task_struct *next) + */ + struct task_struct *prev = (struct task_struct *)ctx[1]; + struct task_struct *next = (struct task_struct *)ctx[2]; + u64 delta, delta_self, delta_other, id; + struct cfs_rq *cfs_rq; + struct timestamp *tsp; + struct wait_lat *lat; + struct cgroup *cgroup; + + /* ivcsw: treat like an enqueue event and store timestamp */ + if (prev->__state == TASK_RUNNING) + trace_enqueue(prev); + + /* only measure for CFS tasks */ + if (next->sched_class != &fair_sched_class) + return 0; + + /* fetch timestamp and calculate delta */ + tsp = bpf_task_storage_get(&start, next, 0, 0); + if (!tsp) + return 0; /* missed enqueue */ + + /* CONFIG_FAIR_GROUP_SCHED may not be enabled */ + cfs_rq = task_cfs_rq(next); + if (!cfs_rq) + return 0; + + /* cpu controller may not be enabled */ + cgroup = task_cgroup(next); + if (!cgroup) + return 0; + + /* calculate self delay and other delay */ + delta = bpf_ktime_get_ns() - tsp->tsp; + delta_self = BPF_CORE_READ(cfs_rq, exec_clock) - tsp->exec_clock; + if (delta_self > delta) + delta_self = delta; + delta_other = delta - delta_self; + + /* insert into cgroup_lat map */ + id = BPF_CORE_READ(cgroup, kn, id); + lat = bpf_map_lookup_elem(&cgroup_lat, &id); + if (!lat) { + struct wait_lat w = { + .queue_self = delta_self, + .queue_other = delta_other, + }; + + bpf_map_update_elem(&cgroup_lat, &id, &w, BPF_ANY); + } else { + lat->queue_self = max(delta_self, lat->queue_self); + lat->queue_other = max(delta_other, lat->queue_other); + } + + bpf_task_storage_delete(&start, next); + return 0; +} + +SEC("iter/cgroup_view") +int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx) +{ + struct seq_file *seq = ctx->meta->seq; + struct cgroup *cgroup = ctx->cgroup; + struct wait_lat *lat; + u64 id; + + BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id); + lat = bpf_map_lookup_elem(&cgroup_lat, &id); + if (lat) { + BPF_SEQ_PRINTF(seq, "queue_self: %8lu\n", lat->queue_self); + BPF_SEQ_PRINTF(seq, "queue_other: %8lu\n", lat->queue_other); + } else { + /* print anyway for universal parsing logic in userspace. */ + BPF_SEQ_PRINTF(seq, "queue_self: %8d\n", 0); + BPF_SEQ_PRINTF(seq, "queue_other: %8d\n", 0); + } + return 0; +}