From patchwork Thu Jul 28 23:52:35 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Beau Belgrave X-Patchwork-Id: 12931823 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5BB6BC00140 for ; Thu, 28 Jul 2022 23:52:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231959AbiG1Xwu (ORCPT ); Thu, 28 Jul 2022 19:52:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35266 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229940AbiG1Xwt (ORCPT ); Thu, 28 Jul 2022 19:52:49 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 472BE6E891; Thu, 28 Jul 2022 16:52:48 -0700 (PDT) Received: from localhost.localdomain (unknown [76.135.27.191]) by linux.microsoft.com (Postfix) with ESMTPSA id F033E20FE9A2; Thu, 28 Jul 2022 16:52:47 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com F033E20FE9A2 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1659052368; bh=pd8SxtOSvngSpxdTqBR6M7sBWyIdow9hqz290CeIJAw=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=V9JxgtG75cJ44mCCrtzE4s8fKCSyGtBEBiNkgr5Y4LRmXpANMCQAJeTuKPjd59LIr H3qmI4ZQde0w2g2JqV4zdaRCdEO0n2WJITUx/Rk8gj9l4icGX84bquc12q7hxc1IVz jU4CO3u3A2sRmVR0o3eCnQKbcqn8OKtXc+MotgCg= From: Beau Belgrave To: rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com Cc: linux-trace-devel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 1/7] tracing/user_events: Remove BROKEN and restore user_events.h location Date: Thu, 28 Jul 2022 16:52:35 -0700 Message-Id: <20220728235241.2249-2-beaub@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220728235241.2249-1-beaub@linux.microsoft.com> References: <20220728235241.2249-1-beaub@linux.microsoft.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-devel@vger.kernel.org After having discussions and addressing the ABI issues user_events can be now marked as working and used by others. As part of the BROKEN status, user_events.h was moved from its original uapi location to the kernel location. This needs to be moved back so it can be used by others. Link: https://lore.kernel.org/all/20220330155835.5e1f6669@gandalf.local.home Link: https://lkml.kernel.org/r/20220330201755.29319-1-mathieu.desnoyers@efficios.com Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/ Link: https://lore.kernel.org/all/1651771383.54437.1652370439159.JavaMail.zimbra@efficios.com/ Signed-off-by: Beau Belgrave --- include/{ => uapi}/linux/user_events.h | 0 kernel/trace/Kconfig | 1 - kernel/trace/trace_events_user.c | 5 ----- 3 files changed, 6 deletions(-) rename include/{ => uapi}/linux/user_events.h (100%) diff --git a/include/linux/user_events.h b/include/uapi/linux/user_events.h similarity index 100% rename from include/linux/user_events.h rename to include/uapi/linux/user_events.h diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index ccd6a5ade3e9..c9302f46a317 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -770,7 +770,6 @@ config USER_EVENTS bool "User trace events" select TRACING select DYNAMIC_EVENTS - depends on BROKEN || COMPILE_TEST # API needs to be straighten out help User trace events are user-defined trace events that can be used like an existing kernel trace event. User trace diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c index 2c0a6ec75548..fd8ea555437a 100644 --- a/kernel/trace/trace_events_user.c +++ b/kernel/trace/trace_events_user.c @@ -19,12 +19,7 @@ #include #include #include -/* Reminder to move to uapi when everything works */ -#ifdef CONFIG_COMPILE_TEST -#include -#else #include -#endif #include "trace.h" #include "trace_dynevent.h" From patchwork Thu Jul 28 23:52:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Beau Belgrave X-Patchwork-Id: 12931827 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8D241C19F2A for ; Thu, 28 Jul 2022 23:52:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229940AbiG1Xwx (ORCPT ); Thu, 28 Jul 2022 19:52:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35284 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231881AbiG1Xwu (ORCPT ); Thu, 28 Jul 2022 19:52:50 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 78D4C6E89A; Thu, 28 Jul 2022 16:52:48 -0700 (PDT) Received: from localhost.localdomain (unknown [76.135.27.191]) by linux.microsoft.com (Postfix) with ESMTPSA id 2DA3220FE9A5; Thu, 28 Jul 2022 16:52:48 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 2DA3220FE9A5 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1659052368; bh=7O98qmAn6+g8y6QrqaZ8tTTvu9eWnVBeLGCQsf5st1c=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=iI4pP1h1fp8H82UNKonSaNZ4H6Nyug3Zdi/bXNbGwUHHZnT/aI25mCu+kY+UJt3j5 WLnlXyk/S1xfqBtqV60v/NxL/lYIrw0QNiEKe4s9BhGGbd7lFvAGEpvJDO7tS+iL8Q hGuKYOIaVEuWpVaV10L34WCJEl2gW5curhYoS+hg= From: Beau Belgrave To: rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com Cc: linux-trace-devel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 2/7] tracing: Add namespace instance directory to tracefs Date: Thu, 28 Jul 2022 16:52:36 -0700 Message-Id: <20220728235241.2249-3-beaub@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220728235241.2249-1-beaub@linux.microsoft.com> References: <20220728235241.2249-1-beaub@linux.microsoft.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-devel@vger.kernel.org Some tracing systems require a group or namespace isolation, such as user_events. The namespace directory in tracefs is a singleton like the instances directory. It also acts like the instances directory in that user-mode processes can create a directory within the namespace if they have appropriate permissions. This change only covers adding the ability for a tracing system to create the namespace directory. A system for adding and managing namespaces will reside within another tracing API. Link: https://lore.kernel.org/all/20220312010140.1880-1-beaub@linux.microsoft.com/ Signed-off-by: Beau Belgrave --- fs/tracefs/inode.c | 119 ++++++++++++++++++++++++++++++++++++++-- include/linux/tracefs.h | 5 ++ 2 files changed, 118 insertions(+), 6 deletions(-) diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index 81d26abf486f..7bf95cc65d78 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -24,6 +24,11 @@ #define TRACEFS_DEFAULT_MODE 0700 +enum tracefs_dir_type { + TRACEFS_DIR_INSTANCES, + TRACEFS_DIR_NAMESPACES, +}; + static struct vfsmount *tracefs_mount; static int tracefs_mount_count; static bool tracefs_registered; @@ -50,6 +55,8 @@ static const struct file_operations tracefs_file_operations = { static struct tracefs_dir_ops { int (*mkdir)(const char *name); int (*rmdir)(const char *name); + int (*ns_mkdir)(const char *name); + int (*ns_rmdir)(const char *name); } tracefs_ops __ro_after_init; static char *get_dname(struct dentry *dentry) @@ -67,9 +74,8 @@ static char *get_dname(struct dentry *dentry) return name; } -static int tracefs_syscall_mkdir(struct user_namespace *mnt_userns, - struct inode *inode, struct dentry *dentry, - umode_t mode) +static int tracefs_syscall_mkdir_core(int type, struct inode *inode, + struct dentry *dentry) { char *name; int ret; @@ -84,7 +90,22 @@ static int tracefs_syscall_mkdir(struct user_namespace *mnt_userns, * mkdir routine to handle races. */ inode_unlock(inode); - ret = tracefs_ops.mkdir(name); + + switch (type) { + case TRACEFS_DIR_INSTANCES: + ret = tracefs_ops.mkdir(name); + break; + + case TRACEFS_DIR_NAMESPACES: + ret = tracefs_ops.ns_mkdir(name); + break; + + default: + pr_debug("tracefs: unknown mkdir type '%d'\n", type); + ret = -ENOENT; + break; + } + inode_lock(inode); kfree(name); @@ -92,7 +113,24 @@ static int tracefs_syscall_mkdir(struct user_namespace *mnt_userns, return ret; } -static int tracefs_syscall_rmdir(struct inode *inode, struct dentry *dentry) +static int tracefs_syscall_mkdir(struct user_namespace *mnt_userns, + struct inode *inode, struct dentry *dentry, + umode_t mode) +{ + return tracefs_syscall_mkdir_core(TRACEFS_DIR_INSTANCES, + inode, dentry); +} + +static int tracefs_syscall_ns_mkdir(struct user_namespace *mnt_userns, + struct inode *inode, struct dentry *dentry, + umode_t mode) +{ + return tracefs_syscall_mkdir_core(TRACEFS_DIR_NAMESPACES, + inode, dentry); +} + +static int tracefs_syscall_rmdir_core(int type, struct inode *inode, + struct dentry *dentry) { char *name; int ret; @@ -111,7 +149,20 @@ static int tracefs_syscall_rmdir(struct inode *inode, struct dentry *dentry) inode_unlock(inode); inode_unlock(d_inode(dentry)); - ret = tracefs_ops.rmdir(name); + switch (type) { + case TRACEFS_DIR_INSTANCES: + ret = tracefs_ops.rmdir(name); + break; + + case TRACEFS_DIR_NAMESPACES: + ret = tracefs_ops.ns_rmdir(name); + break; + + default: + pr_debug("tracefs: unknown rmdir type '%d'\n", type); + ret = -ENOENT; + break; + } inode_lock_nested(inode, I_MUTEX_PARENT); inode_lock(d_inode(dentry)); @@ -121,12 +172,30 @@ static int tracefs_syscall_rmdir(struct inode *inode, struct dentry *dentry) return ret; } +static int tracefs_syscall_rmdir(struct inode *inode, struct dentry *dentry) +{ + return tracefs_syscall_rmdir_core(TRACEFS_DIR_INSTANCES, + inode, dentry); +} + +static int tracefs_syscall_ns_rmdir(struct inode *inode, struct dentry *dentry) +{ + return tracefs_syscall_rmdir_core(TRACEFS_DIR_NAMESPACES, + inode, dentry); +} + static const struct inode_operations tracefs_dir_inode_operations = { .lookup = simple_lookup, .mkdir = tracefs_syscall_mkdir, .rmdir = tracefs_syscall_rmdir, }; +static const struct inode_operations tracefs_dir_inode_ns_operations = { + .lookup = simple_lookup, + .mkdir = tracefs_syscall_ns_mkdir, + .rmdir = tracefs_syscall_ns_rmdir, +}; + static struct inode *tracefs_get_inode(struct super_block *sb) { struct inode *inode = new_inode(sb); @@ -582,6 +651,44 @@ __init struct dentry *tracefs_create_instance_dir(const char *name, return dentry; } +/** + * tracefs_create_namespace_dir - create the tracing namespaces directory + * @name: The name of the namespaces directory to create + * @parent: The parent directory that the namespaces directory will exist + * @mkdir: The function to call when a mkdir is performed. + * @rmdir: The function to call when a rmdir is performed. + * + * Only one namespaces directory is allowed. + * + * The namespaces directory is special as it allows for mkdir and rmdir + * to be done by userspace. When a mkdir or rmdir is performed, the inode + * locks are released and the methods passed in (@mkdir and @rmdir) are + * called without locks and with the name of the directory being created + * within the namespaces directory. + * + * Returns the dentry of the namespaces directory. + */ +__init struct dentry *tracefs_create_namespace_dir(const char *name, + struct dentry *parent, + int (*mkdir)(const char *name), + int (*rmdir)(const char *name)) +{ + struct dentry *dentry; + + /* Only allow one instance of the namespaces directory. */ + if (WARN_ON(tracefs_ops.ns_mkdir || tracefs_ops.ns_rmdir)) + return NULL; + + dentry = __create_dir(name, parent, &tracefs_dir_inode_ns_operations); + if (!dentry) + return NULL; + + tracefs_ops.ns_mkdir = mkdir; + tracefs_ops.ns_rmdir = rmdir; + + return dentry; +} + static void remove_one(struct dentry *victim) { simple_release_fs(&tracefs_mount, &tracefs_mount_count); diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h index 99912445974c..04870dee6c87 100644 --- a/include/linux/tracefs.h +++ b/include/linux/tracefs.h @@ -33,6 +33,11 @@ struct dentry *tracefs_create_instance_dir(const char *name, struct dentry *pare int (*mkdir)(const char *name), int (*rmdir)(const char *name)); +struct dentry *tracefs_create_namespace_dir(const char *name, + struct dentry *parent, + int (*mkdir)(const char *name), + int (*rmdir)(const char *name)); + bool tracefs_initialized(void); #endif /* CONFIG_TRACING */ From patchwork Thu Jul 28 23:52:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Beau Belgrave X-Patchwork-Id: 12931830 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 88A6FC00140 for ; Thu, 28 Jul 2022 23:52:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233408AbiG1Xwz (ORCPT ); Thu, 28 Jul 2022 19:52:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35296 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232316AbiG1Xwv (ORCPT ); Thu, 28 Jul 2022 19:52:51 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id B47726F7EE; Thu, 28 Jul 2022 16:52:48 -0700 (PDT) Received: from localhost.localdomain (unknown [76.135.27.191]) by linux.microsoft.com (Postfix) with ESMTPSA id 653B620FE9A8; Thu, 28 Jul 2022 16:52:48 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 653B620FE9A8 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1659052368; bh=kAarDP4unDCgpeZ3y0NRt3LmijJbD1Usfd1EcpoSZxE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=BpyOXDrKyRkdGv2nq+Ypr8dlmCR2uHSh9kEs4bc4Rq+INrIdovoCEz+31oEa9Rj68 IqcOJCiI7dkzXIWC+FeoFmH9dkmPVKnNhSJ7MmUu/ox6WDvPTWU1H10yp8PHAmG6bJ xqUB49rCz2VqzK3IktZn3CYWTpM9BsREpVd6l2Os= From: Beau Belgrave To: rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com Cc: linux-trace-devel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 3/7] tracing: Add tracing namespace API for systems to register with Date: Thu, 28 Jul 2022 16:52:37 -0700 Message-Id: <20220728235241.2249-4-beaub@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220728235241.2249-1-beaub@linux.microsoft.com> References: <20220728235241.2249-1-beaub@linux.microsoft.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-devel@vger.kernel.org User facing tracing systems, such as user_events and LTTng, sometimes require multiple events with the same name, but from different containers. This can cause event name conflicts and leak out details of events not owned by the container. To create a tracing namespace, run mkdir under the new tracefs directory named "namespaces" (/sys/kernel/tracing/namespaces typically). This directory largely works the same as "instances" where the new directory will have files populated within it via the tracing system automatically. The tracing systems will put their files under the "root" directory, which is meant to be the directory that you can bind mount out to containers. The "options" file is meant to allow operators to configure the namespaces via the registered systems. The tracing namespace allows those user facing systems to register with the tracing namespace. When an operator creates a namespace directory under /sys/kernel/tracing/namespaces the registered systems will have their create operation run for that namespace. The systems can then create files in the new directory used for tracing via user programs. These files will then isolate events between each namespace the operator creates. Typically the system name of the event will have the tracing namespace name appended onto the system name. For example, if a namespace directory was created named "mygroup", then the system name would be ".mygroup". Since the system names are different for each namespace, per-namespace recording/playback can be done by specifying the per-namespace system name and the event name. However, this decision is up to the registered tracing system for each namespace. The operator can then bind mount each namespace directory into containers. This provides isolation between events and containers, if required. It's also possible for several containers to share an isolation via bind mounts instead of having an isolation per-container. With these files being isolated, different permissions can be added for these files than normal tracefs files. This helps scenarios where non-admin processes would like to trace, but currently cannot. Link: https://lore.kernel.org/all/20220312010140.1880-1-beaub@linux.microsoft.com/ Signed-off-by: Beau Belgrave --- kernel/trace/Kconfig | 11 + kernel/trace/Makefile | 1 + kernel/trace/trace.c | 39 +++ kernel/trace/trace_namespace.c | 568 +++++++++++++++++++++++++++++++++ kernel/trace/trace_namespace.h | 57 ++++ 5 files changed, 676 insertions(+) create mode 100644 kernel/trace/trace_namespace.c create mode 100644 kernel/trace/trace_namespace.h diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index c9302f46a317..b8b49a16ca02 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -780,6 +780,17 @@ config USER_EVENTS If in doubt, say N. +config TRACE_NAMESPACE + bool "Tracing namespaces" + select TRACING + help + Tracing namespaces are isolated directories within tracefs + that can be used to isolate tracing events from other events + and processes. Typically this is most useful for user-defined + trace events. + + If in doubt, say N. + config HIST_TRIGGERS bool "Histogram triggers" depends on ARCH_HAVE_NMI_SAFE_CMPXCHG diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 0d261774d6f3..b88241164eb3 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -87,6 +87,7 @@ obj-$(CONFIG_TRACE_EVENT_INJECT) += trace_events_inject.o obj-$(CONFIG_SYNTH_EVENTS) += trace_events_synth.o obj-$(CONFIG_HIST_TRIGGERS) += trace_events_hist.o obj-$(CONFIG_USER_EVENTS) += trace_events_user.o +obj-$(CONFIG_TRACE_NAMESPACE) += trace_namespace.o obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o obj-$(CONFIG_KPROBE_EVENTS) += trace_kprobe.o obj-$(CONFIG_TRACEPOINTS) += error_report-traces.o diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 7eb5bce62500..2b0016969c98 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -53,6 +53,10 @@ #include "trace.h" #include "trace_output.h" +#ifdef CONFIG_TRACE_NAMESPACE +#include "trace_namespace.h" +#endif + /* * On boot up, the ring buffer is set to the minimum size, so that * we do not waste memory on systems that are not using tracing. @@ -9071,6 +9075,10 @@ static const struct file_operations buffer_percent_fops = { static struct dentry *trace_instance_dir; +#ifdef CONFIG_TRACE_NAMESPACE +static struct dentry *trace_namespace_dir; +#endif + static void init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer); @@ -9313,6 +9321,18 @@ static int instance_mkdir(const char *name) return ret; } +#ifdef CONFIG_TRACE_NAMESPACE +static int namespace_mkdir(const char *name) +{ + return trace_namespace_add(name); +} + +static int namespace_rmdir(const char *name) +{ + return trace_namespace_remove(name); +} +#endif + /** * trace_array_get_by_name - Create/Lookup a trace array, given its name. * @name: The name of the trace array to be looked up/created. @@ -9464,6 +9484,21 @@ static __init void create_trace_instances(struct dentry *d_tracer) mutex_unlock(&event_mutex); } +#ifdef CONFIG_TRACE_NAMESPACE +static __init void create_trace_namespaces(struct dentry *d_tracer) +{ + trace_namespace_dir = tracefs_create_namespace_dir("namespaces", + d_tracer, + namespace_mkdir, + namespace_rmdir); + + if (MEM_FAIL(!trace_namespace_dir, "Failed to create namespaces directory\n")) + return; + + trace_namespace_init(trace_namespace_dir); +} +#endif + static void init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer) { @@ -9752,6 +9787,10 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work) create_trace_instances(NULL); +#ifdef CONFIG_TRACE_NAMESPACE + create_trace_namespaces(NULL); +#endif + update_tracer_options(&global_trace); } diff --git a/kernel/trace/trace_namespace.c b/kernel/trace/trace_namespace.c new file mode 100644 index 000000000000..ddec53a41de6 --- /dev/null +++ b/kernel/trace/trace_namespace.c @@ -0,0 +1,568 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (c) 2022, Microsoft Corporation. + * + * Authors: + * Beau Belgrave + */ + +#include +#include +#include +#include +#include +#include +#include "trace.h" +#include "trace_namespace.h" + +static struct dentry *root_namespace_dir; +#define TRACE_ROOT_DIR_NAME "root" +#define TRACE_OPTIONS_NAME "options" + +static LIST_HEAD(namespace_systems); +static LIST_HEAD(namespace_groups); +static DEFINE_IDR(namespace_idr); + +/* + * Stores a registered system operations. + */ +struct namespace_system { + struct list_head link; + struct trace_namespace_operations *ops; +}; + +/* + * Stores namespace specific data about the group. The group can either + * be looked up by name or the id of the trace_namespace property. + */ +struct namespace_group { + struct list_head link; + struct trace_namespace ns; + refcount_t refcnt; + struct dentry *trace_dir; + struct dentry *trace_options; +}; + +/* Current parsing group to allow using trace_parse_run_command */ +static struct namespace_group *parsing_group; + +#define TRACE_NS_FROM_GROUP(group) (&(group)->ns) + +/* + * Runs the parse operation for each registered system for the group. + */ +static int namespace_systems_parse(struct namespace_group *group, + const char *command) +{ + struct list_head *head = &namespace_systems; + struct namespace_system *system; + struct trace_namespace *ns; + int ret = -ENODEV; + + ns = TRACE_NS_FROM_GROUP(group); + + list_for_each_entry(system, head, link) { + ret = system->ops->parse(ns, command); + + if (!ret || ret != -ECANCELED) + break; + } + + if (ret == -ECANCELED) + ret = -EINVAL; + + return ret; +} + + +/* + * Runs the is_busy operation for each registered system for the group. + */ +static bool namespace_systems_busy(struct namespace_group *group) +{ + struct list_head *head = &namespace_systems; + struct namespace_system *system; + struct trace_namespace *ns; + + ns = TRACE_NS_FROM_GROUP(group); + + list_for_each_entry(system, head, link) + if (system->ops->is_busy(ns)) + return true; + + return false; +} + +/* + * Runs the remove operation for each registered system for the group. + * + * NOTE: If a system has a failure it does not stop the other systems from + * having their remove operation run for the group. + */ +static int namespace_systems_remove(struct namespace_group *group, int max) +{ + struct list_head *head = &namespace_systems; + struct namespace_system *system; + struct trace_namespace *ns; + int error, ret = 0, i = 0; + + ns = TRACE_NS_FROM_GROUP(group); + + list_for_each_entry(system, head, link) { + error = system->ops->remove(ns); + i++; + + /* Save last error (if not no entity), but keep removing */ + if (error && error != -ENOENT) + ret = error; + + if (max != -1 && i >= max) + break; + } + + return ret; +} + +/* + * Runs the create operation for each registered system for the group. + * + * NOTE: If a system has a failure, then the previously successful systems + * will have their remove operation run for the group. + */ +static int namespace_systems_create(struct namespace_group *group) +{ + struct list_head *head = &namespace_systems; + struct namespace_system *system; + struct trace_namespace *ns; + int ret = 0, count = 0; + + ns = TRACE_NS_FROM_GROUP(group); + + list_for_each_entry(system, head, link) { + ret = system->ops->create(ns); + + if (ret) + break; + + count++; + } + + /* If we had a failure, remove systems that were created */ + if (ret && count > 0) + namespace_systems_remove(group, count); + + return ret; +} + +/* + * Release a previously acquired reference to a namespace group. + */ +static __always_inline +void namespace_group_release(struct namespace_group *group) +{ + refcount_dec(&group->refcnt); +} + +/* + * Looks up a namespace group by ID and increments the ref count. + */ +static struct namespace_group *namespace_group_ref(int id) +{ + struct namespace_group *group; + + mutex_lock(&event_mutex); + + group = idr_find(&namespace_idr, id); + + if (group) + refcount_inc(&group->refcnt); + + mutex_unlock(&event_mutex); + + return group; +} + +/* + * Looks up a namespace group by name, without increasing the ref count. + */ +static struct namespace_group *namespace_group_find(const char *name) +{ + struct list_head *head = &namespace_groups; + struct namespace_group *group; + struct trace_namespace *ns; + + lockdep_assert_held(&event_mutex); + + list_for_each_entry(group, head, link) { + ns = TRACE_NS_FROM_GROUP(group); + + if (!strcmp(ns->name, name)) + return group; + } + + return NULL; +} + +/* + * Frees group resources and removes the directory of a namespace. + */ +static void namespace_group_destroy(struct namespace_group *group) +{ + struct trace_namespace *ns = TRACE_NS_FROM_GROUP(group); + + lockdep_assert_held(&event_mutex); + + if (ns->id > 0) + idr_remove(&namespace_idr, ns->id); + + if (ns->dir) + tracefs_remove(ns->dir); + + if (group->trace_options) + tracefs_remove(group->trace_options); + + if (group->trace_dir) + tracefs_remove(group->trace_dir); + + kfree(ns->name); + kfree(group); +} + +static void *group_options_seq_start(struct seq_file *m, loff_t *pos) +{ + mutex_lock(&event_mutex); + + return seq_list_start(&namespace_systems, *pos); +} + +static void *group_options_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + return seq_list_next(v, &namespace_systems, pos); +} + +static void group_options_seq_stop(struct seq_file *m, void *v) +{ + mutex_unlock(&event_mutex); +} + +static int group_options_seq_show(struct seq_file *m, void *v) +{ + struct namespace_system *system = v; + struct namespace_group *group = m->private; + + if (system && system->ops && group) + return system->ops->show(TRACE_NS_FROM_GROUP(group), m); + + return 0; +} + +static const struct seq_operations group_options_seq_op = { + .start = group_options_seq_start, + .next = group_options_seq_next, + .stop = group_options_seq_stop, + .show = group_options_seq_show +}; + +/* + * Gets the group associated with the current seq_file. + */ +static struct namespace_group *seq_file_namespace_group(struct file *file) +{ + struct seq_file *seq = file->private_data; + + if (!seq) + return NULL; + + return seq->private; +} + +static int group_options_open(struct inode *node, struct file *file) +{ + struct namespace_group *group; + int ret; + + group = namespace_group_ref((int)(uintptr_t)node->i_private); + + if (!group) + return -ENOENT; + + ret = seq_open(file, &group_options_seq_op); + + if (!ret) { + /* Chain group into seq_file private data */ + struct seq_file *seq = file->private_data; + + seq->private = group; + } + + return ret; +} + +static int group_options_parse(const char *command) +{ + return namespace_systems_parse(parsing_group, command); +} + +static ssize_t group_options_write(struct file *file, const char __user *buffer, + size_t count, loff_t *ppos) +{ + struct namespace_group *group = seq_file_namespace_group(file); + int ret; + + if (!group) + return -EINVAL; + + mutex_lock(&event_mutex); + + /* Set group to use for commands */ + parsing_group = group; + + ret = trace_parse_run_command(file, buffer, count, ppos, + group_options_parse); + + parsing_group = NULL; + + mutex_unlock(&event_mutex); + + return ret; +} + +static int group_options_release(struct inode *node, struct file *file) +{ + struct namespace_group *group = seq_file_namespace_group(file); + + if (group) + namespace_group_release(group); + + return seq_release(node, file); +} + +static const struct file_operations group_options_fops = { + .open = group_options_open, + .read = seq_read, + .llseek = seq_lseek, + .release = group_options_release, + .write = group_options_write, +}; + +/* + * Creates a group that tracks the name and directory of a namespace. + */ +static struct namespace_group *namespace_group_create(const char *name) +{ + struct namespace_group *group; + struct trace_namespace *ns; + + group = kzalloc(sizeof(*group), GFP_KERNEL); + + if (!group) + goto error; + + refcount_set(&group->refcnt, 1); + + ns = TRACE_NS_FROM_GROUP(group); + ns->name = kstrdup(name, GFP_KERNEL); + + if (!ns->name) + goto error; + + /* + * 0 is reserved for non-namespace lookups for systems to use. + * If this were not the case, systems would have to pivot code + * between namespace cases and non-namespace cases. + * + * Cyclic is used here to reduce the chances of the same id being + * used again quickly. This allows for less chance of a id lookup + * to get the wrong namespace during file open cases. + */ + ns->id = idr_alloc_cyclic(&namespace_idr, group, 1, 0, GFP_KERNEL); + + if (ns->id < 0) + goto error; + + group->trace_dir = tracefs_create_dir(ns->name, root_namespace_dir); + + if (!group->trace_dir) + goto error; + + group->trace_options = tracefs_create_file(TRACE_OPTIONS_NAME, + TRACE_MODE_WRITE, + group->trace_dir, + (void *)(uintptr_t)ns->id, + &group_options_fops); + + if (!group->trace_options) + goto error; + + ns->dir = tracefs_create_dir(TRACE_ROOT_DIR_NAME, group->trace_dir); + + if (!ns->dir) + goto error; + + return group; +error: + if (group) + namespace_group_destroy(group); + + return NULL; +} + +/** + * trace_namespace_register - register a system for tracing namespaces. + * @ops: operations to perform for each namespace + * + * Registers a system that runs operations for each namespace on the system. + * This will fail if not all operations are not specified. + */ +int trace_namespace_register(struct trace_namespace_operations *ops) +{ + struct namespace_system *system; + + if (!ops->create || !ops->remove || !ops->is_busy || + !ops->parse || !ops->show) + return -EINVAL; + + system = kmalloc(sizeof(*system), GFP_KERNEL); + + if (!system) + return -ENOMEM; + + system->ops = ops; + + mutex_lock(&event_mutex); + + list_add(&system->link, &namespace_systems); + + mutex_unlock(&event_mutex); + + return 0; +} + +/** + * trace_namespace_init - configures namespaces to be used on the system. + * @dir: directory to use for namespaces + * + * Configures the directory to be used for namespaces. + * + * NOTE: Can only be called once. + */ +int trace_namespace_init(struct dentry *dir) +{ + int ret = 0; + + mutex_lock(&event_mutex); + + if (root_namespace_dir) { + pr_warn("trace namespace init called more than once\n"); + ret = -EEXIST; + goto out; + } + + root_namespace_dir = dir; +out: + mutex_unlock(&event_mutex); + + return ret; +} + +/** + * trace_namespace_add - adds a trace namespace to the system. + * @name: name of the namespace + * + * Adds a new trace namespace to the system. This can fail if the + * namespace already exists or internal errors within sub-systems registered + * for namespaces. + */ +int trace_namespace_add(const char *name) +{ + struct namespace_group *group; + int ret = 0; + + mutex_lock(&event_mutex); + + if (!root_namespace_dir) { + ret = -ENODEV; + goto out; + } + + /* Ensure we don't already have this group */ + group = namespace_group_find(name); + + if (group) { + ret = -EEXIST; + goto out; + } + + /* Create the group */ + group = namespace_group_create(name); + + if (!group) { + ret = -ENOMEM; + goto out; + } + + /* Notify the systems of a new group */ + ret = namespace_systems_create(group); + + if (!ret) + list_add(&group->link, &namespace_groups); +out: + /* Ensure we cleanup on failure */ + if (ret && group) + namespace_group_destroy(group); + + mutex_unlock(&event_mutex); + + return ret; +} + +/** + * trace_namespace_remove - remove a trace namespace from the system. + * @name: name of the namespace + * + * Removes an existing trace namespace from the system. This can fail if + * the namespace doesn't exist, the namespace is busy, or internal errors + * within sub-systems registered for namespaces. + */ +int trace_namespace_remove(const char *name) +{ + struct namespace_group *group; + int ret = 0; + + mutex_lock(&event_mutex); + + if (!root_namespace_dir) { + ret = -ENODEV; + goto out; + } + + group = namespace_group_find(name); + + if (!group) { + ret = -ENOENT; + goto out; + } + + if (refcount_read(&group->refcnt) != 1) { + ret = -EBUSY; + goto out; + } + + if (namespace_systems_busy(group)) { + ret = -EBUSY; + goto out; + } + + ret = namespace_systems_remove(group, -1); + + if (!ret) { + list_del(&group->link); + + namespace_group_destroy(group); + } + +out: + mutex_unlock(&event_mutex); + + return ret; +} diff --git a/kernel/trace/trace_namespace.h b/kernel/trace/trace_namespace.h new file mode 100644 index 000000000000..644e2d6c4802 --- /dev/null +++ b/kernel/trace/trace_namespace.h @@ -0,0 +1,57 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _LINUX_KERNEL_TRACE_NAMESPACE_H +#define _LINUX_KERNEL_TRACE_NAMESPACE_H + +/** + * struct trace_namespace - Trace namespace information + * + * @name: Unique name of the namespace, can be used for event system names, + * etc. + * @dir: Directory of the namespace, can be used for creating system files. + * @id: Id of the namespace, can be used for looking up associated data by + * namespace. NOTE: 0 is reserved for non-namespace lookups for systems. + */ +struct trace_namespace { + const char *name; + struct dentry *dir; + int id; +}; + +/** + * struct trace_namespace_operations - Methods to run for each trace namespace + * + * These methods must be set for each system using trace namespaces. + * + * @create: Run when a trace namespace is being created. Systems create files + * for the namespace with appropriate options. Return 0 if successful. + * @is_busy: Check whether the system is busy within the namespace. Return + * true if it is busy, otherwise false. + * @remove: Removes the namespace from the system. Return 0 if successful, + * return -ENOENT if the namespace is not within the system. All other return + * values are treated as errors. + * @parse: Parses a command to configure a namespace. Return 0 if successful, + * return -ECANCELED if the command is not for your system. All other return + * values are treated as errors. + * @show: Shows the configured options for the namespace. This is run when a + * user reads the options of the namespace. + * + * NOTE: These methods are called while holding event_mutex. + */ +struct trace_namespace_operations { + int (*create)(struct trace_namespace *ns); + int (*remove)(struct trace_namespace *ns); + int (*parse)(struct trace_namespace *ns, const char *command); + int (*show)(struct trace_namespace *ns, struct seq_file *m); + bool (*is_busy)(struct trace_namespace *ns); +}; + +int trace_namespace_register(struct trace_namespace_operations *ops); + +int trace_namespace_init(struct dentry *dir); + +int trace_namespace_add(const char *name); + +int trace_namespace_remove(const char *name); + +#endif /* _LINUX_KERNEL_TRACE_NAMESPACE_H */ From patchwork Thu Jul 28 23:52:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Beau Belgrave X-Patchwork-Id: 12931829 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30420C19F2D for ; Thu, 28 Jul 2022 23:52:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233366AbiG1Xwy (ORCPT ); Thu, 28 Jul 2022 19:52:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35298 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232383AbiG1Xwv (ORCPT ); Thu, 28 Jul 2022 19:52:51 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id EB94A70E52; Thu, 28 Jul 2022 16:52:48 -0700 (PDT) Received: from localhost.localdomain (unknown [76.135.27.191]) by linux.microsoft.com (Postfix) with ESMTPSA id A0D6D20FE9AA; Thu, 28 Jul 2022 16:52:48 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com A0D6D20FE9AA DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1659052368; bh=kkeIYCk4xRugfOKaGOA3vKIixSzKEtc2wGMUQxDoCEw=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Gk4lcjuBPg1338bAh/yKzySDxaPtgFq6jJipfr7O/NCLn99FI8g6OzqdtMvqIlOeX B+++hAJsMkqG5U0dYjOew/BBFffZH4gVhmuzAhMps8+DrAy5oHYIj591F3rKJ4wYyx LcV66e6Rm7OqK4pxl63rZ9G+PDzsTgotckRkwBio= From: Beau Belgrave To: rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com Cc: linux-trace-devel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 4/7] tracing/user_events: Move pages/locks into groups to prepare for namespaces Date: Thu, 28 Jul 2022 16:52:38 -0700 Message-Id: <20220728235241.2249-5-beaub@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220728235241.2249-1-beaub@linux.microsoft.com> References: <20220728235241.2249-1-beaub@linux.microsoft.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-devel@vger.kernel.org In order to enable namespaces or any sort of isolation within user_events the register lock and pages need to be broken up into groups. Each event and file now has a group pointer which stores the actual pages to map, lookup data and synchronization objects. Signed-off-by: Beau Belgrave --- kernel/trace/trace_events_user.c | 381 ++++++++++++++++++++++++------- 1 file changed, 304 insertions(+), 77 deletions(-) diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c index fd8ea555437a..44f9efd58af5 100644 --- a/kernel/trace/trace_events_user.c +++ b/kernel/trace/trace_events_user.c @@ -69,11 +69,23 @@ #define EVENT_STATUS_PERF BIT(1) #define EVENT_STATUS_OTHER BIT(7) -static char *register_page_data; +struct user_event_group { + struct page *pages; + char *register_page_data; + char *system_name; + struct dentry *status_file; + struct dentry *data_file; + struct hlist_node node; + struct mutex reg_mutex; + DECLARE_HASHTABLE(register_table, 8); + DECLARE_BITMAP(page_bitmap, MAX_EVENTS); + refcount_t refcnt; + int id; +}; -static DEFINE_MUTEX(reg_mutex); -static DEFINE_HASHTABLE(register_table, 8); -static DECLARE_BITMAP(page_bitmap, MAX_EVENTS); +static DEFINE_HASHTABLE(group_table, 8); +static DEFINE_MUTEX(group_mutex); +static struct user_event_group *root_group; /* * Stores per-event properties, as users register events @@ -83,6 +95,7 @@ static DECLARE_BITMAP(page_bitmap, MAX_EVENTS); * refcnt reaches one. */ struct user_event { + struct user_event_group *group; struct tracepoint tracepoint; struct trace_event_call call; struct trace_event_class class; @@ -109,6 +122,11 @@ struct user_event_refs { struct user_event *events[]; }; +struct user_event_file_info { + struct user_event_group *group; + struct user_event_refs *refs; +}; + #define VALIDATOR_ENSURE_NULL (1 << 0) #define VALIDATOR_REL (1 << 1) @@ -121,7 +139,8 @@ struct user_event_validator { typedef void (*user_event_func_t) (struct user_event *user, struct iov_iter *i, void *tpdata, bool *faulted); -static int user_event_parse(char *name, char *args, char *flags, +static int user_event_parse(struct user_event_group *group, char *name, + char *args, char *flags, struct user_event **newuser); static u32 user_event_key(char *name) @@ -129,12 +148,132 @@ static u32 user_event_key(char *name) return jhash(name, strlen(name), 0); } +static void set_page_reservations(char *pages, bool set) +{ + int page; + + for (page = 0; page < MAX_PAGES; ++page) { + void *addr = pages + (PAGE_SIZE * page); + + if (set) + SetPageReserved(virt_to_page(addr)); + else + ClearPageReserved(virt_to_page(addr)); + } +} + +static void user_event_group_destroy(struct user_event_group *group) +{ + if (group->status_file) + tracefs_remove(group->status_file); + + if (group->data_file) + tracefs_remove(group->data_file); + + if (group->register_page_data) + set_page_reservations(group->register_page_data, false); + + if (group->pages) + __free_pages(group->pages, MAX_PAGE_ORDER); + + kfree(group->system_name); + kfree(group); +} + +static char *user_event_group_system_name(const char *name) +{ + char *system_name; + int len = strlen(name) + sizeof(USER_EVENTS_SYSTEM) + 1; + + system_name = kmalloc(len, GFP_KERNEL); + + if (!system_name) + return NULL; + + snprintf(system_name, len, "%s.%s", USER_EVENTS_SYSTEM, name); + + return system_name; +} + +static __always_inline +void user_event_group_release(struct user_event_group *group) +{ + refcount_dec(&group->refcnt); +} + +static struct user_event_group *user_event_group_find(int id) +{ + struct user_event_group *group; + + mutex_lock(&group_mutex); + + hash_for_each_possible(group_table, group, node, id) + if (group->id == id) { + refcount_inc(&group->refcnt); + mutex_unlock(&group_mutex); + return group; + } + + mutex_unlock(&group_mutex); + + return NULL; +} + +static struct user_event_group *user_event_group_create(const char *name, + int id) +{ + struct user_event_group *group; + + group = kzalloc(sizeof(*group), GFP_KERNEL); + + if (!group) + return NULL; + + if (name) { + group->system_name = user_event_group_system_name(name); + + if (!group->system_name) + goto error; + } + + group->pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, MAX_PAGE_ORDER); + + if (!group->pages) + goto error; + + group->register_page_data = page_address(group->pages); + + set_page_reservations(group->register_page_data, true); + + /* Zero all bits beside 0 (which is reserved for failures) */ + bitmap_zero(group->page_bitmap, MAX_EVENTS); + set_bit(0, group->page_bitmap); + + mutex_init(&group->reg_mutex); + hash_init(group->register_table); + + /* Mark and add to lookup */ + group->id = id; + refcount_set(&group->refcnt, 2); + + mutex_lock(&group_mutex); + hash_add(group_table, &group->node, group->id); + mutex_unlock(&group_mutex); + + return group; +error: + if (group) + user_event_group_destroy(group); + + return NULL; +}; + static __always_inline void user_event_register_set(struct user_event *user) { int i = user->index; - register_page_data[MAP_STATUS_BYTE(i)] |= MAP_STATUS_MASK(i); + user->group->register_page_data[MAP_STATUS_BYTE(i)] |= MAP_STATUS_MASK(i); } static __always_inline @@ -142,7 +281,7 @@ void user_event_register_clear(struct user_event *user) { int i = user->index; - register_page_data[MAP_STATUS_BYTE(i)] &= ~MAP_STATUS_MASK(i); + user->group->register_page_data[MAP_STATUS_BYTE(i)] &= ~MAP_STATUS_MASK(i); } static __always_inline __must_check @@ -186,7 +325,8 @@ static struct list_head *user_event_get_fields(struct trace_event_call *call) * * Upon success user_event has its ref count increased by 1. */ -static int user_event_parse_cmd(char *raw_command, struct user_event **newuser) +static int user_event_parse_cmd(struct user_event_group *group, + char *raw_command, struct user_event **newuser) { char *name = raw_command; char *args = strpbrk(name, " "); @@ -200,7 +340,7 @@ static int user_event_parse_cmd(char *raw_command, struct user_event **newuser) if (flags) *flags++ = '\0'; - return user_event_parse(name, args, flags, newuser); + return user_event_parse(group, name, args, flags, newuser); } static int user_field_array_size(const char *type) @@ -688,7 +828,7 @@ static int destroy_user_event(struct user_event *user) dyn_event_remove(&user->devent); user_event_register_clear(user); - clear_bit(user->index, page_bitmap); + clear_bit(user->index, user->group->page_bitmap); hash_del(&user->node); user_event_destroy_validators(user); @@ -699,14 +839,15 @@ static int destroy_user_event(struct user_event *user) return ret; } -static struct user_event *find_user_event(char *name, u32 *outkey) +static struct user_event *find_user_event(struct user_event_group *group, + char *name, u32 *outkey) { struct user_event *user; u32 key = user_event_key(name); *outkey = key; - hash_for_each_possible(register_table, user, node, key) + hash_for_each_possible(group->register_table, user, node, key) if (!strcmp(EVENT_NAME(user), name)) { refcount_inc(&user->refcnt); return user; @@ -953,14 +1094,14 @@ static int user_event_create(const char *raw_command) if (!name) return -ENOMEM; - mutex_lock(®_mutex); + mutex_lock(&root_group->reg_mutex); - ret = user_event_parse_cmd(name, &user); + ret = user_event_parse_cmd(root_group, name, &user); if (!ret) refcount_dec(&user->refcnt); - mutex_unlock(®_mutex); + mutex_unlock(&root_group->reg_mutex); if (ret) kfree(name); @@ -1114,7 +1255,8 @@ static int user_event_trace_register(struct user_event *user) * The name buffer lifetime is owned by this method for success cases only. * Upon success the returned user_event has its ref count increased by 1. */ -static int user_event_parse(char *name, char *args, char *flags, +static int user_event_parse(struct user_event_group *group, char *name, + char *args, char *flags, struct user_event **newuser) { int ret; @@ -1124,7 +1266,7 @@ static int user_event_parse(char *name, char *args, char *flags, /* Prevent dyn_event from racing */ mutex_lock(&event_mutex); - user = find_user_event(name, &key); + user = find_user_event(group, name, &key); mutex_unlock(&event_mutex); if (user) { @@ -1137,7 +1279,7 @@ static int user_event_parse(char *name, char *args, char *flags, return 0; } - index = find_first_zero_bit(page_bitmap, MAX_EVENTS); + index = find_first_zero_bit(group->page_bitmap, MAX_EVENTS); if (index == MAX_EVENTS) return -EMFILE; @@ -1151,6 +1293,7 @@ static int user_event_parse(char *name, char *args, char *flags, INIT_LIST_HEAD(&user->fields); INIT_LIST_HEAD(&user->validators); + user->group = group; user->tracepoint.name = name; ret = user_event_parse_fields(user, args); @@ -1170,7 +1313,11 @@ static int user_event_parse(char *name, char *args, char *flags, user->call.tp = &user->tracepoint; user->call.event.funcs = &user_event_funcs; - user->class.system = USER_EVENTS_SYSTEM; + if (group->system_name) + user->class.system = group->system_name; + else + user->class.system = USER_EVENTS_SYSTEM; + user->class.fields_array = user_event_fields_array; user->class.get_fields = user_event_get_fields; user->class.reg = user_event_reg; @@ -1193,8 +1340,8 @@ static int user_event_parse(char *name, char *args, char *flags, dyn_event_init(&user->devent, &user_event_dops); dyn_event_add(&user->devent, &user->call); - set_bit(user->index, page_bitmap); - hash_add(register_table, &user->node, key); + set_bit(user->index, group->page_bitmap); + hash_add(group->register_table, &user->node, key); mutex_unlock(&event_mutex); @@ -1212,10 +1359,10 @@ static int user_event_parse(char *name, char *args, char *flags, /* * Deletes a previously created event if it is no longer being used. */ -static int delete_user_event(char *name) +static int delete_user_event(struct user_event_group *group, char *name) { u32 key; - struct user_event *user = find_user_event(name, &key); + struct user_event *user = find_user_event(group, name, &key); if (!user) return -ENOENT; @@ -1233,6 +1380,7 @@ static int delete_user_event(char *name) */ static ssize_t user_events_write_core(struct file *file, struct iov_iter *i) { + struct user_event_file_info *info = file->private_data; struct user_event_refs *refs; struct user_event *user = NULL; struct tracepoint *tp; @@ -1244,7 +1392,7 @@ static ssize_t user_events_write_core(struct file *file, struct iov_iter *i) rcu_read_lock_sched(); - refs = rcu_dereference_sched(file->private_data); + refs = rcu_dereference_sched(info->refs); /* * The refs->events array is protected by RCU, and new items may be @@ -1302,6 +1450,30 @@ static ssize_t user_events_write_core(struct file *file, struct iov_iter *i) return ret; } +static int user_events_open(struct inode *node, struct file *file) +{ + struct user_event_group *group; + struct user_event_file_info *info; + + group = user_event_group_find((int)(uintptr_t)node->i_private); + + if (!group) + return -ENOENT; + + info = kzalloc(sizeof(*info), GFP_KERNEL); + + if (!info) { + user_event_group_release(group); + return -ENOMEM; + } + + info->group = group; + + file->private_data = info; + + return 0; +} + static ssize_t user_events_write(struct file *file, const char __user *ubuf, size_t count, loff_t *ppos) { @@ -1323,13 +1495,15 @@ static ssize_t user_events_write_iter(struct kiocb *kp, struct iov_iter *i) return user_events_write_core(kp->ki_filp, i); } -static int user_events_ref_add(struct file *file, struct user_event *user) +static int user_events_ref_add(struct user_event_file_info *info, + struct user_event *user) { + struct user_event_group *group = info->group; struct user_event_refs *refs, *new_refs; int i, size, count = 0; - refs = rcu_dereference_protected(file->private_data, - lockdep_is_held(®_mutex)); + refs = rcu_dereference_protected(info->refs, + lockdep_is_held(&group->reg_mutex)); if (refs) { count = refs->count; @@ -1355,7 +1529,7 @@ static int user_events_ref_add(struct file *file, struct user_event *user) refcount_inc(&user->refcnt); - rcu_assign_pointer(file->private_data, new_refs); + rcu_assign_pointer(info->refs, new_refs); if (refs) kfree_rcu(refs, rcu); @@ -1392,7 +1566,8 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg) /* * Registers a user_event on behalf of a user process. */ -static long user_events_ioctl_reg(struct file *file, unsigned long uarg) +static long user_events_ioctl_reg(struct user_event_file_info *info, + unsigned long uarg) { struct user_reg __user *ureg = (struct user_reg __user *)uarg; struct user_reg reg; @@ -1413,14 +1588,14 @@ static long user_events_ioctl_reg(struct file *file, unsigned long uarg) return ret; } - ret = user_event_parse_cmd(name, &user); + ret = user_event_parse_cmd(info->group, name, &user); if (ret) { kfree(name); return ret; } - ret = user_events_ref_add(file, user); + ret = user_events_ref_add(info, user); /* No longer need parse ref, ref_add either worked or not */ refcount_dec(&user->refcnt); @@ -1438,7 +1613,8 @@ static long user_events_ioctl_reg(struct file *file, unsigned long uarg) /* * Deletes a user_event on behalf of a user process. */ -static long user_events_ioctl_del(struct file *file, unsigned long uarg) +static long user_events_ioctl_del(struct user_event_file_info *info, + unsigned long uarg) { void __user *ubuf = (void __user *)uarg; char *name; @@ -1451,7 +1627,7 @@ static long user_events_ioctl_del(struct file *file, unsigned long uarg) /* event_mutex prevents dyn_event from racing */ mutex_lock(&event_mutex); - ret = delete_user_event(name); + ret = delete_user_event(info->group, name); mutex_unlock(&event_mutex); kfree(name); @@ -1465,19 +1641,21 @@ static long user_events_ioctl_del(struct file *file, unsigned long uarg) static long user_events_ioctl(struct file *file, unsigned int cmd, unsigned long uarg) { + struct user_event_file_info *info = file->private_data; + struct user_event_group *group = info->group; long ret = -ENOTTY; switch (cmd) { case DIAG_IOCSREG: - mutex_lock(®_mutex); - ret = user_events_ioctl_reg(file, uarg); - mutex_unlock(®_mutex); + mutex_lock(&group->reg_mutex); + ret = user_events_ioctl_reg(info, uarg); + mutex_unlock(&group->reg_mutex); break; case DIAG_IOCSDEL: - mutex_lock(®_mutex); - ret = user_events_ioctl_del(file, uarg); - mutex_unlock(®_mutex); + mutex_lock(&group->reg_mutex); + ret = user_events_ioctl_del(info, uarg); + mutex_unlock(&group->reg_mutex); break; } @@ -1489,17 +1667,24 @@ static long user_events_ioctl(struct file *file, unsigned int cmd, */ static int user_events_release(struct inode *node, struct file *file) { + struct user_event_file_info *info = file->private_data; + struct user_event_group *group; struct user_event_refs *refs; struct user_event *user; int i; + if (!info) + return -EINVAL; + + group = info->group; + /* * Ensure refs cannot change under any situation by taking the * register mutex during the final freeing of the references. */ - mutex_lock(®_mutex); + mutex_lock(&group->reg_mutex); - refs = file->private_data; + refs = info->refs; if (!refs) goto out; @@ -1518,32 +1703,54 @@ static int user_events_release(struct inode *node, struct file *file) out: file->private_data = NULL; - mutex_unlock(®_mutex); + mutex_unlock(&group->reg_mutex); kfree(refs); + kfree(info); + + /* No longer using group */ + user_event_group_release(group); return 0; } static const struct file_operations user_data_fops = { + .open = user_events_open, .write = user_events_write, .write_iter = user_events_write_iter, .unlocked_ioctl = user_events_ioctl, .release = user_events_release, }; +static struct user_event_group *user_status_group(struct file *file) +{ + struct seq_file *m = file->private_data; + + if (!m) + return NULL; + + return m->private; +} + /* * Maps the shared page into the user process for checking if event is enabled. */ static int user_status_mmap(struct file *file, struct vm_area_struct *vma) { + char *pages; + struct user_event_group *group = user_status_group(file); unsigned long size = vma->vm_end - vma->vm_start; if (size != MAX_BYTES) return -EINVAL; + if (!group) + return -EINVAL; + + pages = group->register_page_data; + return remap_pfn_range(vma, vma->vm_start, - virt_to_phys(register_page_data) >> PAGE_SHIFT, + virt_to_phys(pages) >> PAGE_SHIFT, size, vm_get_page_prot(VM_READ)); } @@ -1567,13 +1774,17 @@ static void user_seq_stop(struct seq_file *m, void *p) static int user_seq_show(struct seq_file *m, void *p) { + struct user_event_group *group = m->private; struct user_event *user; char status; int i, active = 0, busy = 0, flags; - mutex_lock(®_mutex); + if (!group) + return -EINVAL; + + mutex_lock(&group->reg_mutex); - hash_for_each(register_table, i, user, node) { + hash_for_each(group->register_table, i, user, node) { status = user->status; flags = user->flags; @@ -1597,7 +1808,7 @@ static int user_seq_show(struct seq_file *m, void *p) active++; } - mutex_unlock(®_mutex); + mutex_unlock(&group->reg_mutex); seq_puts(m, "\n"); seq_printf(m, "Active: %d\n", active); @@ -1616,7 +1827,38 @@ static const struct seq_operations user_seq_ops = { static int user_status_open(struct inode *node, struct file *file) { - return seq_open(file, &user_seq_ops); + struct user_event_group *group; + int ret; + + group = user_event_group_find((int)(uintptr_t)node->i_private); + + if (!group) + return -ENOENT; + + ret = seq_open(file, &user_seq_ops); + + if (!ret) { + /* Chain group to seq_file */ + struct seq_file *m = file->private_data; + + m->private = group; + } else { + user_event_group_release(group); + } + + return ret; +} + +static int user_status_release(struct inode *node, struct file *file) +{ + struct user_event_group *group = user_status_group(file); + + if (group) + user_event_group_release(group); + else + pr_warn("user_events: No group attached to status file\n"); + + return seq_release(node, file); } static const struct file_operations user_status_fops = { @@ -1624,18 +1866,20 @@ static const struct file_operations user_status_fops = { .mmap = user_status_mmap, .read = seq_read, .llseek = seq_lseek, - .release = seq_release, + .release = user_status_release, }; /* * Creates a set of tracefs files to allow user mode interactions. */ -static int create_user_tracefs(void) +static int create_user_tracefs(struct dentry *parent, + struct user_event_group *group) { struct dentry *edata, *emmap; edata = tracefs_create_file("user_events_data", TRACE_MODE_WRITE, - NULL, NULL, &user_data_fops); + parent, (void *)(uintptr_t)group->id, + &user_data_fops); if (!edata) { pr_warn("Could not create tracefs 'user_events_data' entry\n"); @@ -1644,7 +1888,8 @@ static int create_user_tracefs(void) /* mmap with MAP_SHARED requires writable fd */ emmap = tracefs_create_file("user_events_status", TRACE_MODE_WRITE, - NULL, NULL, &user_status_fops); + parent, (void *)(uintptr_t)group->id, + &user_status_fops); if (!emmap) { tracefs_remove(edata); @@ -1652,47 +1897,29 @@ static int create_user_tracefs(void) goto err; } + group->data_file = edata; + group->status_file = emmap; + return 0; err: return -ENODEV; } -static void set_page_reservations(bool set) -{ - int page; - - for (page = 0; page < MAX_PAGES; ++page) { - void *addr = register_page_data + (PAGE_SIZE * page); - - if (set) - SetPageReserved(virt_to_page(addr)); - else - ClearPageReserved(virt_to_page(addr)); - } -} - static int __init trace_events_user_init(void) { - struct page *pages; int ret; - /* Zero all bits beside 0 (which is reserved for failures) */ - bitmap_zero(page_bitmap, MAX_EVENTS); - set_bit(0, page_bitmap); + root_group = user_event_group_create(NULL, 0); - pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, MAX_PAGE_ORDER); - if (!pages) + if (!root_group) return -ENOMEM; - register_page_data = page_address(pages); - - set_page_reservations(true); - ret = create_user_tracefs(); + ret = create_user_tracefs(NULL, root_group); if (ret) { pr_warn("user_events could not register with tracefs\n"); - set_page_reservations(false); - __free_pages(pages, MAX_PAGE_ORDER); + user_event_group_destroy(root_group); + root_group = NULL; return ret; } From patchwork Thu Jul 28 23:52:39 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Beau Belgrave X-Patchwork-Id: 12931825 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 252A3C3F6B0 for ; Thu, 28 Jul 2022 23:52:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233213AbiG1Xwx (ORCPT ); Thu, 28 Jul 2022 19:52:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35288 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231757AbiG1Xwu (ORCPT ); Thu, 28 Jul 2022 19:52:50 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id AB85A7171A; Thu, 28 Jul 2022 16:52:49 -0700 (PDT) Received: from localhost.localdomain (unknown [76.135.27.191]) by linux.microsoft.com (Postfix) with ESMTPSA id D887820FE9AC; Thu, 28 Jul 2022 16:52:48 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com D887820FE9AC DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1659052369; bh=AUgYkvSQU9ibc65r7fNJmBL+pnRrox56knfluC9Psp0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=YLdQu5vLfUh67N0cayijcX5/lq1FjeIDFtaKnK3fJZTpfjspOm3paWB1+5Lf6uxQQ zeavhYYpiFLpbq/WOPoXbqWmQsDMUIgAwQ37OLQVxAX/iaoxBhrSR0FpbfFkdh9n/X A3Dooz/2a0QT+AxJW4xu43Xxqp1VoqjLnvyMBSlU= From: Beau Belgrave To: rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com Cc: linux-trace-devel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 5/7] tracing/user_events: Register with trace namespace API Date: Thu, 28 Jul 2022 16:52:39 -0700 Message-Id: <20220728235241.2249-6-beaub@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220728235241.2249-1-beaub@linux.microsoft.com> References: <20220728235241.2249-1-beaub@linux.microsoft.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-devel@vger.kernel.org Register user_events up to the trace namespace API to allow user programs to interface with isolated events when required. Each namespace will have their own user_events_status and user_events_data files that have the same ABI as before, however, the system name for events created will be different (user_events. vs user_events). Signed-off-by: Beau Belgrave --- kernel/trace/trace_events_user.c | 167 ++++++++++++++++++++++++++++++- 1 file changed, 166 insertions(+), 1 deletion(-) diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c index 44f9efd58af5..9694eee27956 100644 --- a/kernel/trace/trace_events_user.c +++ b/kernel/trace/trace_events_user.c @@ -23,6 +23,10 @@ #include "trace.h" #include "trace_dynevent.h" +#ifdef CONFIG_TRACE_NAMESPACE +#include "trace_namespace.h" +#endif + #define USER_EVENTS_PREFIX_LEN (sizeof(USER_EVENTS_PREFIX)-1) #define FIELD_DEPTH_TYPE 0 @@ -180,6 +184,18 @@ static void user_event_group_destroy(struct user_event_group *group) kfree(group); } +static void user_event_group_unlink(struct user_event_group *group) +{ + if (WARN_ON(refcount_read(&group->refcnt) != 1)) + pr_warn("user_event: Group unlink with more than 1 ref\n"); + + mutex_lock(&group_mutex); + hash_del(&group->node); + mutex_unlock(&group_mutex); + + user_event_group_destroy(group); +} + static char *user_event_group_system_name(const char *name) { char *system_name; @@ -262,6 +278,7 @@ static struct user_event_group *user_event_group_create(const char *name, return group; error: + /* Hash table not added, safe to destroy vs unlink */ if (group) user_event_group_destroy(group); @@ -1905,6 +1922,148 @@ static int create_user_tracefs(struct dentry *parent, return -ENODEV; } +#ifdef CONFIG_TRACE_NAMESPACE +static int user_event_ns_create(struct trace_namespace *ns) +{ + struct user_event_group *group; + int ret; + + group = user_event_group_create(ns->name, ns->id); + + if (!group) + return -ENOMEM; + + ret = create_user_tracefs(ns->dir, group); + + user_event_group_release(group); + + if (ret) { + user_event_group_unlink(group); + return ret; + } + + return 0; +} + +static int user_event_ns_remove(struct trace_namespace *ns) +{ + struct user_event_group *group = user_event_group_find(ns->id); + struct user_event *user; + struct hlist_node *tmp; + int i, ret = 0; + + if (!group) + return -ENOENT; + + /* + * Lock out finding this namespace while we are doing this so that + * user programs trying to open a file owned by this group will block + * until we are done here. The user program upon unblocking will then + * fail to find the group if we removed it. + */ + mutex_lock(&group_mutex); + + /* Ensure we have the only reference */ + if (refcount_read(&group->refcnt) != 2) { + ret = -EBUSY; + goto out; + } + + /* + * At this point no more files can be opened by user space programs + * while we are holding the group_mutex (they'll block on group_mutex). + * To ensure other parts of the kernel aren't registering something we + * also grab the group register mutex as an extra precaution. + * + * The events might be being recorded, which will result in their + * being busy and we'll bail out. + * + * NOTE: event_mutex is held, locking reg_mutex could deadlock so we + * must try to lock it and treat as busy if we cannot. + */ + if (!mutex_trylock(&group->reg_mutex)) { + ret = -EBUSY; + goto out; + } + + hash_for_each_safe(group->register_table, i, tmp, user, node) { + if (!user_event_last_ref(user)) { + ret = -EBUSY; + break; + } + + ret = destroy_user_event(user); + + if (ret) + break; + } + + mutex_unlock(&group->reg_mutex); +out: + mutex_unlock(&group_mutex); + + user_event_group_release(group); + + if (!ret) + user_event_group_unlink(group); + + return ret; +} + +static int user_event_ns_parse(struct trace_namespace *ns, const char *command) +{ + return -ECANCELED; +} + +static int user_event_ns_show(struct trace_namespace *ns, struct seq_file *m) +{ + return 0; +} + +static bool user_event_ns_is_busy(struct trace_namespace *ns) +{ + struct user_event_group *group = user_event_group_find(ns->id); + struct user_event *user; + int i; + bool busy = false; + + if (!group) + return false; + + /* + * Quick check to ensure all events aren't busy: + * The actual remove will do a more exhaustive check including + * finding if any outstanding files are opened, etc. + * + * NOTE: event_mutex is held, locking reg_mutex could deadlock so we + * must try to lock it and treat as busy if we cannot. + */ + if (!mutex_trylock(&group->reg_mutex)) + return true; + + hash_for_each(group->register_table, i, user, node) { + if (!user_event_last_ref(user)) { + busy = true; + break; + } + } + + mutex_unlock(&group->reg_mutex); + + user_event_group_release(group); + + return busy; +} + +static struct trace_namespace_operations user_event_ns_ops = { + .create = user_event_ns_create, + .remove = user_event_ns_remove, + .parse = user_event_ns_parse, + .show = user_event_ns_show, + .is_busy = user_event_ns_is_busy, +}; +#endif + static int __init trace_events_user_init(void) { int ret; @@ -1918,7 +2077,8 @@ static int __init trace_events_user_init(void) if (ret) { pr_warn("user_events could not register with tracefs\n"); - user_event_group_destroy(root_group); + user_event_group_release(root_group); + user_event_group_unlink(root_group); root_group = NULL; return ret; } @@ -1926,6 +2086,11 @@ static int __init trace_events_user_init(void) if (dyn_event_register(&user_event_dops)) pr_warn("user_events could not register with dyn_events\n"); +#ifdef CONFIG_TRACE_NAMESPACE + if (trace_namespace_register(&user_event_ns_ops)) + pr_warn("user_events could not register with namespaces\n"); +#endif + return 0; } From patchwork Thu Jul 28 23:52:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Beau Belgrave X-Patchwork-Id: 12931826 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4C716C04A68 for ; Thu, 28 Jul 2022 23:52:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233136AbiG1Xww (ORCPT ); Thu, 28 Jul 2022 19:52:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35286 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231617AbiG1Xwu (ORCPT ); Thu, 28 Jul 2022 19:52:50 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id ABB2071731; Thu, 28 Jul 2022 16:52:49 -0700 (PDT) Received: from localhost.localdomain (unknown [76.135.27.191]) by linux.microsoft.com (Postfix) with ESMTPSA id 1622920FE9AF; Thu, 28 Jul 2022 16:52:49 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 1622920FE9AF DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1659052369; bh=HaSDfOxma22GjtN+QHmKmuwb7fowkhGpnjJEeEWHD1I=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=IDSBV9uuzKRNHwoO6GlvQi96BtvNyXgAz4hmeQ3SoHWgTlhfTdVOS1Tcj3L40RsuS IQlildg0Z9sasvdYcs1qs+LO4imtcTyDrfiiSg3vF8wdJGdQB/mmCY95ZE72uT/N9i Pg5GTptIYrD96jUTKpbPcrY3PggMGZEadWDT4Wus= From: Beau Belgrave To: rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com Cc: linux-trace-devel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 6/7] tracing/user_events: Enable setting event limit within namespace Date: Thu, 28 Jul 2022 16:52:40 -0700 Message-Id: <20220728235241.2249-7-beaub@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220728235241.2249-1-beaub@linux.microsoft.com> References: <20220728235241.2249-1-beaub@linux.microsoft.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-devel@vger.kernel.org When granting non-admin users the ability to register and write data to user events they should have a limit imposed. Using the namespace options file, operators can change the limit of the events that are allowed to be created. There is also a new line in the user_events_status file to let users know the current limit (and to ask the operator for more if required). For example, to limit the namespace to only 256 events: echo user_events_limit=256 > options From within the namespace root: cat user_events_status ... Limit: 256 Signed-off-by: Beau Belgrave --- kernel/trace/trace_events_user.c | 57 +++++++++++++++++++++++++++++--- 1 file changed, 52 insertions(+), 5 deletions(-) diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c index 9694eee27956..1dc88bbd04f9 100644 --- a/kernel/trace/trace_events_user.c +++ b/kernel/trace/trace_events_user.c @@ -41,6 +41,7 @@ #define MAX_PAGES (1 << MAX_PAGE_ORDER) #define MAX_BYTES (MAX_PAGES * PAGE_SIZE) #define MAX_EVENTS (MAX_BYTES * 8) +#define MAX_LIMIT (MAX_EVENTS - 1) /* Limit how long of an event name plus args within the subsystem. */ #define MAX_EVENT_DESC 512 @@ -85,6 +86,7 @@ struct user_event_group { DECLARE_BITMAP(page_bitmap, MAX_EVENTS); refcount_t refcnt; int id; + int reg_limit; }; static DEFINE_HASHTABLE(group_table, 8); @@ -252,6 +254,13 @@ static struct user_event_group *user_event_group_create(const char *name, goto error; } + /* + * Register limit is based on available events: + * The ABI states event 0 is reserved, so the real max is the amount + * of bits in the bitmap minus 1 (the reserved event slot). + */ + group->reg_limit = MAX_LIMIT; + group->pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, MAX_PAGE_ORDER); if (!group->pages) @@ -1276,8 +1285,7 @@ static int user_event_parse(struct user_event_group *group, char *name, char *args, char *flags, struct user_event **newuser) { - int ret; - int index; + int ret, index, limit; u32 key; struct user_event *user; @@ -1296,9 +1304,16 @@ static int user_event_parse(struct user_event_group *group, char *name, return 0; } - index = find_first_zero_bit(group->page_bitmap, MAX_EVENTS); + /* + * 0 is a reserved bit, so the real limit needs to be one higher. + * An example of this is a limit of 1, bit 0 is always set. To make + * this work, the limit must be 2 in this case (bit 1 will be set). + */ + limit = min(group->reg_limit + 1, (int)MAX_EVENTS); + + index = find_first_zero_bit(group->page_bitmap, limit); - if (index == MAX_EVENTS) + if (index == limit) return -EMFILE; user = kzalloc(sizeof(*user), GFP_KERNEL); @@ -1831,6 +1846,7 @@ static int user_seq_show(struct seq_file *m, void *p) seq_printf(m, "Active: %d\n", active); seq_printf(m, "Busy: %d\n", busy); seq_printf(m, "Max: %ld\n", MAX_EVENTS); + seq_printf(m, "Limit: %d\n", group->reg_limit); return 0; } @@ -2010,13 +2026,44 @@ static int user_event_ns_remove(struct trace_namespace *ns) return ret; } +#define NS_EVENT_LIMIT_PREFIX "user_events_limit=" + static int user_event_ns_parse(struct trace_namespace *ns, const char *command) { - return -ECANCELED; + struct user_event_group *group = user_event_group_find(ns->id); + int len, value, ret = -ECANCELED; + + if (!group) + return -ECANCELED; + + len = str_has_prefix(command, NS_EVENT_LIMIT_PREFIX); + if (len && !kstrtouint(command + len, 0, &value)) { + if (value <= 0 || value > MAX_LIMIT) { + ret = -EINVAL; + goto out; + } + + group->reg_limit = value; + ret = 0; + goto out; + } +out: + user_event_group_release(group); + + return ret; } static int user_event_ns_show(struct trace_namespace *ns, struct seq_file *m) { + struct user_event_group *group = user_event_group_find(ns->id); + + if (!group) + return 0; + + seq_printf(m, "%s%d\n", NS_EVENT_LIMIT_PREFIX, group->reg_limit); + + user_event_group_release(group); + return 0; } From patchwork Thu Jul 28 23:52:41 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Beau Belgrave X-Patchwork-Id: 12931828 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9D7CAC19F2B for ; Thu, 28 Jul 2022 23:52:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233247AbiG1Xwy (ORCPT ); Thu, 28 Jul 2022 19:52:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35290 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231925AbiG1Xwu (ORCPT ); Thu, 28 Jul 2022 19:52:50 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id BA10272ECF; Thu, 28 Jul 2022 16:52:49 -0700 (PDT) Received: from localhost.localdomain (unknown [76.135.27.191]) by linux.microsoft.com (Postfix) with ESMTPSA id 47EC520FE9B1; Thu, 28 Jul 2022 16:52:49 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 47EC520FE9B1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1659052369; bh=5eRwg6MlWo573EnBgxRRM6VTU+gEpz5q8Hd5iRDGIg8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=cuSi9laXedSlBVsCx5ELVW8t2t4+PyF4tx4kgYARfCeMYRJmgknvB0VRD1XuEntk+ PlXqDY49fBlhGFDXqHja2dmL3Lcm+nIzS/9WSGlAspf0JWhxH3ZeDc+UNe/Ni4YAuI QF/jycic2z6seY8Snc6VGYnfZcFZr4csqgN93ig8= From: Beau Belgrave To: rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com Cc: linux-trace-devel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 7/7] tracing/user_events: Add self-test for namespace integration Date: Thu, 28 Jul 2022 16:52:41 -0700 Message-Id: <20220728235241.2249-8-beaub@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220728235241.2249-1-beaub@linux.microsoft.com> References: <20220728235241.2249-1-beaub@linux.microsoft.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-devel@vger.kernel.org Tests to ensure namespace cases are working correctly. Ensures that namespaces work as before for status/write cases and validates removing a namespace with open files, tracing enabled, etc. Signed-off-by: Beau Belgrave --- .../selftests/user_events/ftrace_test.c | 150 ++++++++++++++++++ 1 file changed, 150 insertions(+) diff --git a/tools/testing/selftests/user_events/ftrace_test.c b/tools/testing/selftests/user_events/ftrace_test.c index 404a2713dcae..5d384c1b31c4 100644 --- a/tools/testing/selftests/user_events/ftrace_test.c +++ b/tools/testing/selftests/user_events/ftrace_test.c @@ -22,6 +22,16 @@ const char *enable_file = "/sys/kernel/debug/tracing/events/user_events/__test_e const char *trace_file = "/sys/kernel/debug/tracing/trace"; const char *fmt_file = "/sys/kernel/debug/tracing/events/user_events/__test_event/format"; +const char *namespace_dir = "/sys/kernel/debug/tracing/namespaces/self_test"; +const char *ns_data_file = "/sys/kernel/debug/tracing/namespaces/self_test/" + "root/user_events_data"; +const char *ns_status_file = "/sys/kernel/debug/tracing/namespaces/self_test/" + "root/user_events_status"; +const char *ns_enable_file = "/sys/kernel/debug/tracing/events/" + "user_events.self_test/__test_event/enable"; +const char *ns_options_file = "/sys/kernel/debug/tracing/namespaces/self_test/" + "options"; + static inline int status_check(char *status_page, int status_bit) { return status_page[status_bit >> 3] & (1 << (status_bit & 7)); @@ -160,6 +170,53 @@ static int check_print_fmt(const char *event, const char *expected) return strcmp(print_fmt, expected); } +FIXTURE(ns) { + int status_fd; + int data_fd; + int enable_fd; + int options_fd; +}; + +FIXTURE_SETUP(ns) { + if (mkdir(namespace_dir, 770)) { + ASSERT_EQ(EEXIST, errno); + } + + self->status_fd = open(ns_status_file, O_RDONLY); + ASSERT_NE(-1, self->status_fd); + + self->data_fd = open(ns_data_file, O_RDWR); + ASSERT_NE(-1, self->data_fd); + + self->options_fd = open(ns_options_file, O_RDWR); + ASSERT_NE(-1, self->options_fd); + + self->enable_fd = -1; +} + +FIXTURE_TEARDOWN(ns) { + if (self->status_fd != -1) + close(self->status_fd); + + if (self->data_fd != -1) + close(self->data_fd); + + if (self->options_fd != -1) + close(self->options_fd); + + if (self->enable_fd != -1) { + write(self->enable_fd, "0", sizeof("0")); + close(self->enable_fd); + self->enable_fd = -1; + } + + ASSERT_EQ(0, clear()); + + if (rmdir(namespace_dir)) { + ASSERT_EQ(ENOENT, errno); + } +} + FIXTURE(user) { int status_fd; int data_fd; @@ -477,6 +534,99 @@ TEST_F(user, print_fmt) { ASSERT_EQ(0, ret); } +TEST_F(ns, namespaces) { + struct user_reg reg = {0}; + struct iovec io[3]; + __u32 field1, field2; + int before = 0, after = 0; + int page_size = sysconf(_SC_PAGESIZE); + char *status_page; + + reg.size = sizeof(reg); + reg.name_args = (__u64)"__test_event u32 field1; u32 field2"; + + field1 = 1; + field2 = 2; + + io[0].iov_base = ®.write_index; + io[0].iov_len = sizeof(reg.write_index); + io[1].iov_base = &field1; + io[1].iov_len = sizeof(field1); + io[2].iov_base = &field2; + io[2].iov_len = sizeof(field2); + + /* Limit to 1 event */ + ASSERT_NE(-1, write(self->options_fd, + "user_events_limit=1\n", + sizeof("user_events_limit=1\n") - 1)); + + /* Register should work */ + ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSREG, ®)); + ASSERT_EQ(0, reg.write_index); + ASSERT_NE(0, reg.status_bit); + + status_page = mmap(NULL, page_size, PROT_READ, MAP_SHARED, + self->status_fd, 0); + + /* MMAP should work and be zero'd */ + ASSERT_NE(MAP_FAILED, status_page); + ASSERT_NE(NULL, status_page); + ASSERT_EQ(0, status_check(status_page, reg.status_bit)); + + /* Enable event (start tracing) */ + self->enable_fd = open(ns_enable_file, O_RDWR); + ASSERT_NE(-1, write(self->enable_fd, "1", sizeof("1"))) + + /* Event should now be enabled */ + ASSERT_NE(0, status_check(status_page, reg.status_bit)); + + /* Write should make it out to ftrace buffers */ + before = trace_bytes(); + ASSERT_NE(-1, writev(self->data_fd, (const struct iovec *)io, 3)); + after = trace_bytes(); + ASSERT_GT(after, before); + + /* Register above limit should fail */ + reg.name_args = (__u64)"__test_event_nope u32 field1; u32 field2"; + ASSERT_EQ(-1, ioctl(self->data_fd, DIAG_IOCSREG, ®)); + ASSERT_EQ(EMFILE, errno); + + /* Removing namespace while files open should fail */ + ASSERT_EQ(-1, rmdir(namespace_dir)); + + close(self->options_fd); + self->options_fd = -1; + + /* Removing namespace while files open should fail */ + ASSERT_EQ(-1, rmdir(namespace_dir)); + + close(self->status_fd); + self->status_fd = -1; + + /* Removing namespace while files open should fail */ + ASSERT_EQ(-1, rmdir(namespace_dir)); + + close(self->data_fd); + self->data_fd = -1; + + /* Removing namespace while mmaps are open should fail */ + ASSERT_EQ(-1, rmdir(namespace_dir)); + + /* Unmap */ + ASSERT_EQ(0, munmap(status_page, page_size)); + + /* Removing namespace with no files but tracing should fail */ + ASSERT_EQ(-1, rmdir(namespace_dir)); + + /* Disable event (stop tracing) */ + ASSERT_NE(-1, write(self->enable_fd, "0", sizeof("0"))) + close(self->enable_fd); + self->enable_fd = -1; + + /* Removing namespace should now work */ + ASSERT_EQ(0, rmdir(namespace_dir)); +} + int main(int argc, char **argv) { return test_harness_run(argc, argv);