Message ID | 20211005224428.2551-1-beaub@linux.microsoft.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | user_events: Enable user processes to create and write to trace events | expand |
Hello Beau, On Tue, 5 Oct 2021 15:44:28 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > User mode processes that wish to use trace events to get data into > ftrace, perf, eBPF, etc are limited to uprobes today. The user events > features enables an ABI for user mode processes to create and write to > trace events that are isolated from kernel level trace events. This > enables a faster path for tracing from user mode data as well as opens > managed code to participate in trace events, where stub locations are > dynamic. Thanks for your work! This looks very good candidate of the user static events instead of using uprobe + sdt-marker. And sorry I need more time to review your code, so this time I would like to discuss its interface design. > User processes often want to trace only when it's useful. To enable this > a set of pages are mapped into the user process space that indicate the > current state of the user events that have been registered. User > processes can check if their event is hooked to a trace/probe, and if it > is, emit the event data out via the write() syscall. > > Two files are added to tracefs to accomplish this: > user_events_mmap - This file is mmap'd into participating user mode > processes to indicate event status. It seems like the "method" is used for the file name of user_events_mmap, instead of what is the purpose of the file. What about "user_events_status" and show the informations (which program made events and what tracer is using the events) when you read the file? And when you mmap it, or when you ioctl to get mmapable fd, your program can directly monitor the flags. > > user_events_data - This file is opened and register/delete ioctl's are > issued to create/open/delete trace events that can be used for tracing. > > The typical scenario is on process start to mmap user_events_mmap. Processes > then register the events they plan to use via the REG ioctl. The return value > of the ioctl indicates the byte in the mmap to use for status. The file that > was used for the ioctl is now accepting data via write() to emit out into the > trace event. > > Psuedo code example of typical usage: > page_fd = open("user_events_mmap", O_RDWR); > page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0); > > data_fd = open("user_events_data", O_RDWR); > data_id = ioctl(data_fd, DIAG_IOCSREG, "test"); Hmm, if the data_id is same as the ID in events/*/*/format, it will be queried by libtraceevent or libftrace, isn't it? And also, if you can define the user-event via dynamic_event interface, it should be used instead of using ioctl on data channel. > > if (page_data[data_id]) write(data_fd, &payload, sizeof(payload)); > > User events are also exposed via the dynamic_events tracefs file for > both create, delete and current status. > > Simple example to register a user event via dynamic_events and get status: > echo ue:test >> dynamic_events > cat dynamic_events > ue:test I think you can use only 'u' instead of 'ue' for event command prefix. (Uprobes and kprobes shares 'p' and 'r', those are identified by the place specification format) > If an event is hooked to a probe, the probe hooked shows up: > echo 1 > events/user_events/test/enable > cat dynamic_events > ue:test (Used by ftrace) No, please don't show such "used" flag in definition file, since "(Used by *)" is not parsed when we write it. The dynamic_events is only used for defining events, not showing each event status. > > Users can describe the trace event format via the following format: > name[;field1;field2] > > Each field has the following format: > type\tname\tsize\toffset I'm not sure why don't you use the same syntax of synth events. you don't need the offset. > > Example for char array with a size of 20 named msg: > echo -e 'ue:detailed;char[]\tmsg\t20\t0' >> dynamic_events > cat dynamic_events > ue:detailed;char[] msg Oh, I thought that the each user events should not be accessible from other processes. And I got the reason, if you run different processes (instances) at the same time, those may need to share same events. In summary, I would like to suggest following interface to split the control, monitor and data channels. - dynamic_events : Add new 'u' command but derived from synth event (or share the parser) - user_event_status : monitor status by mmap, and showing human-readable status by read. query offset by ioctl. - user_event_data : just write data (binary packet). User written binary packet will be based on the event format, for example, echo "u:foo u64 mydata; char[] msg" /sys/kernel/debug/tracing # cat events/user_events/foo/format name: foo ID: 1234 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:u64 mydata; offset:8; size:8; signed:0; field:__data_loc char[] msg; offset:16; size:4; signed:1; Then, the binary should be formatted as; [u32 ID][u64 mydata][u16 msg-len][u16 msg-offs][msg-data] Then user will generate a packet like; [1234][0xdeadbeef][12][16]["Hello world"] And write it to the user_event_data file. Kernel will verify len and offs if there is a dynamic array (string), and record common fields, update msg-offs (because it will be shift the size of common fields) and record user-data. I also think we need a user-mode library for the above interface so that user can easily bind to each application and language. What would you think? Thank you, > > Data offsets are based on the data written out via write() and will be > updated to reflect the correct offset in the trace_event fields. __data_loc > types must be aware of the size of trace_entry/common properties to ensure > proper decoding. An ioctl is provided that enables user mode processes that > only have access to user_events_data that returns the correct offset to use > within the data payload (nothing needs to be done on registration). > > The above format is valid for both the ioctl and the dynamic_events file. > > Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> > --- > kernel/trace/Kconfig | 15 + > kernel/trace/Makefile | 1 + > kernel/trace/trace_events_user.c | 845 +++++++++++++++++++++++++++++++ > 3 files changed, 861 insertions(+) > create mode 100644 kernel/trace/trace_events_user.c > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig > index 3ee23f4d437f..deaaad421be4 100644 > --- a/kernel/trace/Kconfig > +++ b/kernel/trace/Kconfig > @@ -725,6 +725,21 @@ config SYNTH_EVENTS > > If in doubt, say N. > > +config USER_EVENTS > + bool "User trace events" > + select TRACING > + select DYNAMIC_EVENTS > + default n > + help > + User trace events are user-defined trace events that > + can be used like an existing kernel trace event. User trace > + events are generated by writing to a tracefs file. User > + processes can determine if their tracing events should be > + generated by memory mapping a tracefs file and checking for > + an associated byte being non-zero. > + > + If in doubt, say N. > + > config HIST_TRIGGERS > bool "Histogram triggers" > depends on ARCH_HAVE_NMI_SAFE_CMPXCHG > diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile > index b1c47ccf4f73..a653b255e89c 100644 > --- a/kernel/trace/Makefile > +++ b/kernel/trace/Makefile > @@ -80,6 +80,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o > obj-$(CONFIG_TRACE_EVENT_INJECT) += trace_events_inject.o > obj-$(CONFIG_SYNTH_EVENTS) += trace_events_synth.o > obj-$(CONFIG_HIST_TRIGGERS) += trace_events_hist.o > +obj-$(CONFIG_USER_EVENTS) += trace_events_user.o > obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o > obj-$(CONFIG_KPROBE_EVENTS) += trace_kprobe.o > obj-$(CONFIG_TRACEPOINTS) += error_report-traces.o > diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c > new file mode 100644 > index 000000000000..9afa72b55fa8 > --- /dev/null > +++ b/kernel/trace/trace_events_user.c > @@ -0,0 +1,845 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright (c) 2021, Microsoft Corporation. > + * > + * Authors: > + * Beau Belgrave <beaub@linux.microsoft.com> > + */ > + > +#include <linux/bitmap.h> > +#include <linux/cdev.h> > +#include <linux/hashtable.h> > +#include <linux/io.h> > +#include <linux/ioctl.h> > +#include <linux/jhash.h> > +#include <linux/trace_events.h> > +#include <linux/tracefs.h> > +#include <linux/types.h> > +#include <linux/uaccess.h> > +#include "trace.h" > +#include "trace_dynevent.h" > + > +#define USER_EVENTS_SYSTEM "user_events" > +#define USER_EVENTS_PREFIX "ue:" > +#define USER_EVENTS_PREFIX_LEN (sizeof(USER_EVENTS_PREFIX)-1) > + > +/* Bits 0-6 are for known probe types, Bit 7 is for unknown probes */ > +#define EVENT_BIT_FTRACE 0 > +#define EVENT_BIT_PERF 1 > +#define EVENT_BIT_OTHER 7 > + > +#define EVENT_STATUS_FTRACE (1 << EVENT_BIT_FTRACE) > +#define EVENT_STATUS_PERF (1 << EVENT_BIT_PERF) > +#define EVENT_STATUS_OTHER (1 << EVENT_BIT_OTHER) > + > +#define FIELD_DEPTH_TYPE 0 > +#define FIELD_DEPTH_NAME 1 > +#define FIELD_DEPTH_SIZE 2 > +#define FIELD_DEPTH_OFFSET 3 > + > +/* > + * Limits how many trace_event calls user processes can create: > + * Must be multiple of PAGE_SIZE. > + */ > +#define MAX_PAGES 1 > +#define MAX_EVENTS (MAX_PAGES * PAGE_SIZE) > + > +/* Limit how long of an event name plus args within the subsystem. */ > +#define MAX_EVENT_DESC 512 > +#define EVENT_NAME(user_event) ((user_event)->tracepoint.name) > + > +#define DIAG_IOC_MAGIC '*' > +#define DIAG_IOCSREG _IOW(DIAG_IOC_MAGIC, 0, char*) > +#define DIAG_IOCSDEL _IOW(DIAG_IOC_MAGIC, 1, char*) > +#define DIAG_IOCQLOCOFFSET _IO(DIAG_IOC_MAGIC, 2) > + > +static char *register_page_data; > + > +static DEFINE_HASHTABLE(register_table, 4); > +static DECLARE_BITMAP(page_bitmap, MAX_EVENTS); > + > +struct user_event { > + struct tracepoint tracepoint; > + struct trace_event_call call; > + struct trace_event_class class; > + struct dyn_event devent; > + struct hlist_node node; > + atomic_t refs; > + int index; > + char *args; > +}; > + > +#ifdef CONFIG_PERF_EVENTS > +struct user_bpf_context { > + int udatalen; > + const char __user *udata; > +}; > +#endif > + > +typedef void (*user_event_func_t) (struct user_event *user, > + const char __user *udata, > + size_t udatalen, void *tpdata); > + > +static int register_user_event(char *name, char *args, > + struct user_event **newuser); > + > +/* > + * Parses a register command for user_events > + * Format: event_name[;field1;field2;...] > + * > + * Example event named test with a 20 char msg field at offset 0 with a unsigned int at offset 20: > + * test;char[]\tmsg\t20\t0;unsigned int\tid\t4\t20; > + * > + * NOTE: Offsets are from the user data perspective, they are not from the > + * trace_entry/buffer perspective. We automatically add the common properties > + * sizes to the offset for the user. Types of __data_loc must trace a value that > + * is offset by the value of the DIAG_IOCQLOCOFFSET ioctl to decode properly. > + * This makes it easy for the common cases via the terminal, as only __data_loc > + * types require an awareness by the user of the common property offsets. > + */ > +static int user_event_parse_cmd(char *raw_command, struct user_event **newuser) > +{ > + char *name = raw_command; > + char *args = strpbrk(name, ";"); > + > + if (args) > + *args++ = 0; > + > + return register_user_event(name, args, newuser); > +} > + > +/* > + * Parses the values of a field within the description > + * Format: type\tname\tsize\toffset\t[future additions\t] > + */ > +static int user_event_parse_field(char *field, struct user_event *user) > +{ > + char *part, *type, *name; > + u32 size, offset; > + int depth = 0; > + > + while ((part = strsep(&field, "\t")) != NULL) { > + switch (depth++) { > + case FIELD_DEPTH_TYPE: > + type = part; > + break; > + case FIELD_DEPTH_NAME: > + name = part; > + break; > + case FIELD_DEPTH_SIZE: > + if (kstrtou32(part, 10, &size)) > + return -EINVAL; > + break; > + case FIELD_DEPTH_OFFSET: > + if (kstrtou32(part, 10, &offset)) > + return -EINVAL; > + /* > + * User does not know what trace_entry size is > + * so we have to add to the offset. For data loc > + * scenarios, user mode applications must be aware > + * of this size when emitting the data location. > + * The DIAG_IOCQLOCOFFSET ioctl can be used to get this. > + */ > + offset += sizeof(struct trace_entry); > + break; > + default: > + /* Forward compatibility, ignore */ > + goto end; > + } > + } > +end: > + if (depth < FIELD_DEPTH_OFFSET) > + return -EINVAL; > + > + if (!strcmp(type, "print_fmt")) { > + user->call.print_fmt = name; > + return 0; > + } > + > + return trace_define_field(&user->call, type, name, offset, size, > + type[0] != 'u', FILTER_OTHER); > +} > + > +/* > + * Parses the fields that were described for the event > + */ > +static int user_event_parse_fields(struct user_event *user) > +{ > + char *field; > + int ret = -EINVAL; > + > + while ((field = strsep(&user->args, ";")) != NULL) { > + ret = user_event_parse_field(field, user); > + > + if (ret) > + break; > + } > + > + return ret; > +} > + > +static int user_event_define_fields(struct trace_event_call *call) > +{ > + struct user_event *user = (struct user_event *)call->data; > + > + /* User chose to not disclose arguments */ > + if (user->args == NULL) > + return 0; > + > + return user_event_parse_fields(user); > +} > + > +static struct trace_event_fields user_event_fields_array[] = { > + { .type = TRACE_FUNCTION_TYPE, > + .define_fields = user_event_define_fields }, > + {} > +}; > + > +static enum print_line_t user_event_print_trace(struct trace_iterator *iter, > + int flags, > + struct trace_event *event) > +{ > + /* Unsafe to try to decode user provided print_fmt, use hex */ > + trace_print_hex_dump_seq(&iter->seq, "", DUMP_PREFIX_OFFSET, 16, > + 1, iter->ent, iter->ent_size, true); > + > + return trace_handle_return(&iter->seq); > +} > + > +static struct trace_event_functions user_event_funcs = { > + .trace = user_event_print_trace, > +}; > + > +static int destroy_user_event(struct user_event *user) > +{ > + int ret = 0; > + > + /* > + * trace_remove_event_call invokes unregister_trace_event: > + * Pick the correct one based on if we set the data or not > + */ > + if (user->index != 0) { > + ret = trace_remove_event_call(&user->call); > + > + if (ret) > + return ret; > + > + dyn_event_remove(&user->devent); > + > + register_page_data[user->index] = 0; > + clear_bit(user->index, page_bitmap); > + hash_del(&user->node); > + } else { > + unregister_trace_event(&user->call.event); > + } > + > + kfree(EVENT_NAME(user)); > + kfree(user); > + > + return ret; > +} > + > +static struct user_event *find_user_event(u32 key, char *name) > +{ > + struct user_event *user; > + > + hash_for_each_possible(register_table, user, node, key) > + if (!strcmp(EVENT_NAME(user), name)) > + return user; > + > + return NULL; > +} > + > +/* > + * Writes the user supplied payload out to a trace file. > + */ > +static void user_event_ftrace(struct user_event *user, const char __user *udata, > + size_t udatalen, void *tpdata) > +{ > + struct trace_event_file *file; > + struct trace_entry *entry; > + struct trace_event_buffer event_buffer; > + > + file = (struct trace_event_file *)tpdata; > + > + if (!file || > + !(file->flags & EVENT_FILE_FL_ENABLED) || > + trace_trigger_soft_disabled(file)) > + return; > + > + entry = trace_event_buffer_reserve(&event_buffer, file, > + sizeof(*entry) + udatalen); > + > + if (!entry) > + return; > + > + if (!copy_from_user(entry + 1, udata, udatalen)) > + trace_event_buffer_commit(&event_buffer); > +} > + > +#ifdef CONFIG_PERF_EVENTS > +/* > + * Writes the user supplied payload out to perf ring buffer or eBPF program. > + */ > +static void user_event_perf(struct user_event *user, const char __user *udata, > + size_t udatalen, void *tpdata) > +{ > + struct hlist_head *perf_head; > + > + if (bpf_prog_array_valid(&user->call)) { > + struct user_bpf_context context = {0}; > + > + context.udatalen = udatalen; > + context.udata = udata; > + > + trace_call_bpf(&user->call, &context); > + } > + > + perf_head = this_cpu_ptr(user->call.perf_events); > + > + if (perf_head && !hlist_empty(perf_head)) { > + struct trace_entry *perf_entry; > + struct pt_regs *regs; > + size_t size = sizeof(*perf_entry) + udatalen; > + int context; > + > + perf_entry = perf_trace_buf_alloc(ALIGN(size, 8), > + ®s, &context); > + > + if (!perf_entry) > + return; > + > + perf_fetch_caller_regs(regs); > + > + if (copy_from_user(perf_entry + 1, > + udata, > + udatalen)) > + return; > + > + perf_trace_buf_submit(perf_entry, size, context, > + user->call.event.type, 1, regs, > + perf_head, NULL); > + } > +} > +#endif > + > +/* > + * Update the register page that is shared between user processes. > + */ > +static void update_reg_page_for(struct user_event *user) > +{ > + struct tracepoint *tp = &user->tracepoint; > + char status = 0; > + > + if (atomic_read(&tp->key.enabled) > 0) { > + struct tracepoint_func *probe_func_ptr; > + user_event_func_t probe_func; > + > + probe_func_ptr = rcu_dereference_sched(tp->funcs); > + > + if (probe_func_ptr) { > + do { > + probe_func = probe_func_ptr->func; > + > + if (probe_func == user_event_ftrace) > + status |= EVENT_STATUS_FTRACE; > +#ifdef CONFIG_PERF_EVENTS > + else if (probe_func == user_event_perf) > + status |= EVENT_STATUS_PERF; > +#endif > + else > + status |= EVENT_STATUS_OTHER; > + } while ((++probe_func_ptr)->func); > + } > + } > + > + register_page_data[user->index] = status; > +} > + > +/* > + * Register callback for our events from tracing sub-systems. > + */ > +static int user_event_reg(struct trace_event_call *call, > + enum trace_reg type, > + void *data) > +{ > + struct user_event *user = (struct user_event *)call->data; > + int ret = 0; > + > + if (!user) > + return -ENOENT; > + > + switch (type) { > + case TRACE_REG_REGISTER: > + ret = tracepoint_probe_register(call->tp, > + call->class->probe, > + data); > + if (!ret) > + goto inc; > + break; > + > + case TRACE_REG_UNREGISTER: > + tracepoint_probe_unregister(call->tp, > + call->class->probe, > + data); > + goto dec; > + > +#ifdef CONFIG_PERF_EVENTS > + case TRACE_REG_PERF_REGISTER: > + ret = tracepoint_probe_register(call->tp, > + call->class->perf_probe, > + data); > + if (!ret) > + goto inc; > + break; > + > + case TRACE_REG_PERF_UNREGISTER: > + tracepoint_probe_unregister(call->tp, > + call->class->perf_probe, > + data); > + goto dec; > + > + case TRACE_REG_PERF_OPEN: > + case TRACE_REG_PERF_CLOSE: > + case TRACE_REG_PERF_ADD: > + case TRACE_REG_PERF_DEL: > + break; > +#endif > + } > + > + return ret; > +inc: > + atomic_inc(&user->refs); > + update_reg_page_for(user); > + return 0; > +dec: > + update_reg_page_for(user); > + atomic_dec(&user->refs); > + return 0; > +} > + > +static u32 user_event_key(char *name) > +{ > + return jhash(name, strlen(name), 0); > +} > + > +static int user_event_create(const char *raw_command) > +{ > + struct user_event *user; > + char *name; > + int ret; > + > + if (strstr(raw_command, USER_EVENTS_PREFIX) != raw_command) > + return -ECANCELED; > + > + name = kstrdup(raw_command + USER_EVENTS_PREFIX_LEN, GFP_KERNEL); > + > + if (!name) > + return -ENOMEM; > + > + mutex_lock(&event_mutex); > + ret = user_event_parse_cmd(name, &user); > + mutex_unlock(&event_mutex); > + > + return ret; > +} > + > +static int user_event_show(struct seq_file *m, struct dyn_event *ev) > +{ > + struct user_event *user = container_of(ev, struct user_event, devent); > + struct ftrace_event_field *field, *next; > + struct list_head *head; > + char status; > + > + seq_printf(m, "%s%s", USER_EVENTS_PREFIX, EVENT_NAME(user)); > + > + head = trace_get_fields(&user->call); > + > + list_for_each_entry_safe(field, next, head, link) > + seq_printf(m, ";%s %s", field->type, field->name); > + > + status = register_page_data[user->index]; > + > + if (status != 0) { > + seq_puts(m, " (Used by"); > + if (status & EVENT_STATUS_FTRACE) > + seq_puts(m, " ftrace"); > + if (status & EVENT_STATUS_PERF) > + seq_puts(m, " perf"); > + if (status & EVENT_STATUS_OTHER) > + seq_puts(m, " other"); > + seq_puts(m, ")"); > + } > + > + seq_puts(m, "\n"); > + > + return 0; > +} > + > +static bool user_event_is_busy(struct dyn_event *ev) > +{ > + struct user_event *user = container_of(ev, struct user_event, devent); > + > + return atomic_read(&user->refs) != 0; > +} > + > +static int user_event_free(struct dyn_event *ev) > +{ > + struct user_event *user = container_of(ev, struct user_event, devent); > + > + return destroy_user_event(user); > +} > + > +static bool user_event_match(const char *system, const char *event, > + int argc, const char **argv, struct dyn_event *ev) > +{ > + struct user_event *user = container_of(ev, struct user_event, devent); > + > + return strcmp(EVENT_NAME(user), event) == 0 && > + (!system || strcmp(system, USER_EVENTS_SYSTEM) == 0); > +} > + > +static struct dyn_event_operations user_event_dops = { > + .create = user_event_create, > + .show = user_event_show, > + .is_busy = user_event_is_busy, > + .free = user_event_free, > + .match = user_event_match, > +}; > + > +/* > + * Register a trace_event into the system, either find or create. > + */ > +static int register_user_event(char *name, char *args, > + struct user_event **newuser) > +{ > + int ret; > + int index; > + u32 key = user_event_key(name); > + struct user_event *user = find_user_event(key, name); > + > + if (user) { > + *newuser = user; > + ret = 0; > + goto put_name; > + } > + > + index = find_first_zero_bit(page_bitmap, MAX_EVENTS); > + > + if (index == MAX_EVENTS) { > + ret = -EMFILE; > + goto put_name; > + } > + > + user = kzalloc(sizeof(*user), GFP_KERNEL); > + > + if (!user) { > + ret = -ENOMEM; > + goto put_name; > + } > + > + INIT_LIST_HEAD(&user->class.fields); > + > + user->tracepoint.name = name; > + user->args = args; > + > + user->call.data = user; > + user->call.class = &user->class; > + user->call.name = name; > + user->call.flags = TRACE_EVENT_FL_TRACEPOINT; > + user->call.tp = &user->tracepoint; > + user->call.event.funcs = &user_event_funcs; > + > + user->class.system = USER_EVENTS_SYSTEM; > + user->class.fields_array = user_event_fields_array; > + user->class.reg = user_event_reg; > + user->class.probe = user_event_ftrace; > +#ifdef CONFIG_PERF_EVENTS > + user->class.perf_probe = user_event_perf; > +#endif > + > + ret = register_trace_event(&user->call.event); > + > + if (!ret) { > + ret = -ENODEV; > + goto put_user; > + } > + > + ret = trace_add_event_call(&user->call); > + > + if (ret) { > + destroy_user_event(user); > + goto out; > + } > + > + user->index = index; > + dyn_event_init(&user->devent, &user_event_dops); > + dyn_event_add(&user->devent); > + set_bit(user->index, page_bitmap); > + hash_add(register_table, &user->node, key); > + > + *newuser = user; > + return 0; > +put_user: > + kfree(user); > +put_name: > + kfree(name); > +out: > + return ret; > +} > + > +/* > + * Deletes a previously created event if it is no longer being used. > + */ > +static int delete_user_event(char *name) > +{ > + u32 key = user_event_key(name); > + struct user_event *user = find_user_event(key, name); > + > + if (!user) > + return -ENOENT; > + > + if (atomic_read(&user->refs) != 0) > + return -EBUSY; > + > + return destroy_user_event(user); > +} > + > +/* > + * Validates the user payload and writes to the appropriate sub-system. > + */ > +static ssize_t user_events_write(struct file *file, const char __user *ubuf, > + size_t count, loff_t *ppos) > +{ > + struct user_event *user; > + struct tracepoint *tp; > + > + if (*ppos != 0 || count <= 0) > + return -EFAULT; > + > + user = file->private_data; > + > + if (!user) > + return -ENOENT; > + > + tp = &user->tracepoint; > + > + if (likely(atomic_read(&tp->key.enabled) > 0)) { > + struct tracepoint_func *probe_func_ptr; > + user_event_func_t probe_func; > + void *tpdata; > + > + preempt_disable(); > + > + if (unlikely(!(cpu_online(raw_smp_processor_id())))) > + goto preempt_out; > + > + probe_func_ptr = rcu_dereference_sched(tp->funcs); > + > + if (probe_func_ptr) { > + do { > + probe_func = probe_func_ptr->func; > + tpdata = probe_func_ptr->data; > + probe_func(user, ubuf, count, tpdata); > + } while ((++probe_func_ptr)->func); > + } > +preempt_out: > + preempt_enable(); > + } > + > + return count; > +} > + > +/* > + * Handles the ioctl from user mode to register or alter operations. > + */ > +static long user_events_ioctl(struct file *file, unsigned int cmd, > + unsigned long uarg) > +{ > + void __user *ubuf = (void __user *)uarg; > + struct user_event *user; > + char *name; > + long ret; > + > + switch (cmd) { > + case DIAG_IOCSREG: > + /* Register/lookup on behalf of user process */ > + name = strndup_user(ubuf, MAX_EVENT_DESC); > + > + if (IS_ERR(name)) { > + ret = PTR_ERR(name); > + goto out; > + } > + > + mutex_lock(&event_mutex); > + > + if (file->private_data) { > + /* Already associated with an event */ > + ret = -EMFILE; > + kfree(name); > + goto reg_out; > + } > + > + ret = user_event_parse_cmd(name, &user); > + > + if (!ret) { > + file->private_data = user; > + atomic_inc(&user->refs); > + } > +reg_out: > + mutex_unlock(&event_mutex); > + > + if (ret < 0) > + goto out; > + > + /* Return page index to check before writes */ > + ret = user->index; > + break; > + > + case DIAG_IOCSDEL: > + /* Delete on behalf of user process */ > + name = strndup_user(ubuf, MAX_EVENT_DESC); > + > + if (IS_ERR(name)) { > + ret = PTR_ERR(name); > + goto out; > + } > + > + mutex_lock(&event_mutex); > + ret = delete_user_event(name); > + mutex_unlock(&event_mutex); > + > + kfree(name); > + break; > + > + case DIAG_IOCQLOCOFFSET: > + /* > + * Return data offset to use for data locs. This enables > + * user mode processes to query the common property sizes. > + * If this was not known, the data location values written > + * would be incorrect from the user mode side. > + */ > + ret = sizeof(struct trace_entry); > + break; > + > + default: > + ret = -ENOTTY; > + break; > + } > +out: > + return ret; > +} > + > +/* > + * Handles the final close of the file from user mode. > + */ > +static int user_events_release(struct inode *node, struct file *file) > +{ > + struct user_event *user = file->private_data; > + > + if (user) > + atomic_dec(&user->refs); > + > + return 0; > +} > + > +static const struct file_operations user_events_data_fops = { > + .write = user_events_write, > + .unlocked_ioctl = user_events_ioctl, > + .release = user_events_release, > +}; > + > +/* > + * Maps the shared page into the user process for checking if event is enabled. > + */ > +static int user_events_mmap(struct file *filp, struct vm_area_struct *vma) > +{ > + unsigned long size = vma->vm_end - vma->vm_start; > + > + if (size != MAX_EVENTS) > + return -EINVAL; > + > + return remap_pfn_range(vma, vma->vm_start, > + virt_to_phys(register_page_data) >> PAGE_SHIFT, > + size, PAGE_READONLY); > +} > + > +static const struct file_operations user_events_mmap_fops = { > + .mmap = user_events_mmap, > +}; > + > +/* > + * Creates a set of tracefs files to allow user mode interactions. > + */ > +static int create_user_tracefs(void) > +{ > + struct dentry *edata, *emmap; > + > + edata = tracefs_create_file("user_events_data", 0644, NULL, > + NULL, &user_events_data_fops); > + > + if (!edata) { > + pr_warn("Could not create tracefs 'user_events_data' entry\n"); > + goto err; > + } > + > + /* mmap with MAP_SHARED requires writable fd */ > + emmap = tracefs_create_file("user_events_mmap", 0644, NULL, > + NULL, &user_events_mmap_fops); > + > + if (!emmap) { > + tracefs_remove(edata); > + pr_warn("Could not create tracefs 'user_events_mmap' entry\n"); > + goto err; > + } > + > + return 0; > +err: > + return -ENODEV; > +} > + > +static void set_page_reservations(bool set) > +{ > + int page; > + > + for (page = 0; page < MAX_PAGES; ++page) { > + void *addr = register_page_data + (PAGE_SIZE * page); > + > + if (set) > + SetPageReserved(virt_to_page(addr)); > + else > + ClearPageReserved(virt_to_page(addr)); > + } > +} > + > +static int __init trace_events_user_init(void) > +{ > + int ret; > + > + /* Zero all bits beside 0 (which is reserved for failures) */ > + bitmap_zero(page_bitmap, MAX_EVENTS); > + set_bit(0, page_bitmap); > + > + register_page_data = kmalloc(MAX_EVENTS, GFP_KERNEL); > + > + if (!register_page_data) > + return -ENOMEM; > + > + set_page_reservations(true); > + > + ret = create_user_tracefs(); > + > + if (ret) { > + pr_warn("user_events could not register with tracefs\n"); > + set_page_reservations(false); > + kfree(register_page_data); > + return ret; > + } > + > + if (dyn_event_register(&user_event_dops)) > + pr_warn("user_events could not register with dyn_events\n"); > + > + return 0; > +} > + > +fs_initcall(trace_events_user_init); > -- > 2.17.1 >
On Tue, 5 Oct 2021 15:44:28 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > User mode processes that wish to use trace events to get data into > ftrace, perf, eBPF, etc are limited to uprobes today. The user events > features enables an ABI for user mode processes to create and write to > trace events that are isolated from kernel level trace events. This > enables a faster path for tracing from user mode data as well as opens > managed code to participate in trace events, where stub locations are > dynamic. > > User processes often want to trace only when it's useful. To enable this > a set of pages are mapped into the user process space that indicate the > current state of the user events that have been registered. User > processes can check if their event is hooked to a trace/probe, and if it > is, emit the event data out via the write() syscall. > > Two files are added to tracefs to accomplish this: > user_events_mmap - This file is mmap'd into participating user mode > processes to indicate event status. > > user_events_data - This file is opened and register/delete ioctl's are > issued to create/open/delete trace events that can be used for tracing. > > The typical scenario is on process start to mmap user_events_mmap. Processes > then register the events they plan to use via the REG ioctl. The return value > of the ioctl indicates the byte in the mmap to use for status. The file that > was used for the ioctl is now accepting data via write() to emit out into the > trace event. > > Psuedo code example of typical usage: > page_fd = open("user_events_mmap", O_RDWR); > page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0); > > data_fd = open("user_events_data", O_RDWR); > data_id = ioctl(data_fd, DIAG_IOCSREG, "test"); > > if (page_data[data_id]) write(data_fd, &payload, sizeof(payload)); What is the type of "page_data". I'd like to test it before accepting it. From playing around, I see that page_data is of type char *. > > User events are also exposed via the dynamic_events tracefs file for > both create, delete and current status. > > Simple example to register a user event via dynamic_events and get status: > echo ue:test >> dynamic_events > cat dynamic_events > ue:test > > If an event is hooked to a probe, the probe hooked shows up: > echo 1 > events/user_events/test/enable > cat dynamic_events > ue:test (Used by ftrace) > > Users can describe the trace event format via the following format: > name[;field1;field2] > > Each field has the following format: > type\tname\tsize\toffset BTW, the format should follow the way other probes are created. That is, having a space between the name and field, and not a semicolon. > > Example for char array with a size of 20 named msg: > echo -e 'ue:detailed;char[]\tmsg\t20\t0' >> dynamic_events > cat dynamic_events > ue:detailed;char[] msg > > Data offsets are based on the data written out via write() and will be > updated to reflect the correct offset in the trace_event fields. __data_loc > types must be aware of the size of trace_entry/common properties to ensure > proper decoding. An ioctl is provided that enables user mode processes that > only have access to user_events_data that returns the correct offset to use > within the data payload (nothing needs to be done on registration). > > The above format is valid for both the ioctl and the dynamic_events file. > > Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> > --- > kernel/trace/Kconfig | 15 + > kernel/trace/Makefile | 1 + > kernel/trace/trace_events_user.c | 845 +++++++++++++++++++++++++++++++ > 3 files changed, 861 insertions(+) > create mode 100644 kernel/trace/trace_events_user.c > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig > index 3ee23f4d437f..deaaad421be4 100644 > --- a/kernel/trace/Kconfig > +++ b/kernel/trace/Kconfig > @@ -725,6 +725,21 @@ config SYNTH_EVENTS > > If in doubt, say N. > > +config USER_EVENTS > + bool "User trace events" > + select TRACING > + select DYNAMIC_EVENTS > + default n > + help > + User trace events are user-defined trace events that > + can be used like an existing kernel trace event. User trace > + events are generated by writing to a tracefs file. User > + processes can determine if their tracing events should be > + generated by memory mapping a tracefs file and checking for > + an associated byte being non-zero. > + > + If in doubt, say N. > + > config HIST_TRIGGERS > bool "Histogram triggers" > depends on ARCH_HAVE_NMI_SAFE_CMPXCHG > diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile > index b1c47ccf4f73..a653b255e89c 100644 > --- a/kernel/trace/Makefile > +++ b/kernel/trace/Makefile > @@ -80,6 +80,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o > obj-$(CONFIG_TRACE_EVENT_INJECT) += trace_events_inject.o > obj-$(CONFIG_SYNTH_EVENTS) += trace_events_synth.o > obj-$(CONFIG_HIST_TRIGGERS) += trace_events_hist.o > +obj-$(CONFIG_USER_EVENTS) += trace_events_user.o > obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o > obj-$(CONFIG_KPROBE_EVENTS) += trace_kprobe.o > obj-$(CONFIG_TRACEPOINTS) += error_report-traces.o > diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c > new file mode 100644 > index 000000000000..9afa72b55fa8 > --- /dev/null > +++ b/kernel/trace/trace_events_user.c > @@ -0,0 +1,845 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright (c) 2021, Microsoft Corporation. > + * > + * Authors: > + * Beau Belgrave <beaub@linux.microsoft.com> > + */ > + > +#include <linux/bitmap.h> > +#include <linux/cdev.h> > +#include <linux/hashtable.h> > +#include <linux/io.h> > +#include <linux/ioctl.h> > +#include <linux/jhash.h> > +#include <linux/trace_events.h> > +#include <linux/tracefs.h> > +#include <linux/types.h> > +#include <linux/uaccess.h> > +#include "trace.h" > +#include "trace_dynevent.h" > + > +#define USER_EVENTS_SYSTEM "user_events" > +#define USER_EVENTS_PREFIX "ue:" > +#define USER_EVENTS_PREFIX_LEN (sizeof(USER_EVENTS_PREFIX)-1) > + > +/* Bits 0-6 are for known probe types, Bit 7 is for unknown probes */ > +#define EVENT_BIT_FTRACE 0 > +#define EVENT_BIT_PERF 1 > +#define EVENT_BIT_OTHER 7 > + > +#define EVENT_STATUS_FTRACE (1 << EVENT_BIT_FTRACE) > +#define EVENT_STATUS_PERF (1 << EVENT_BIT_PERF) > +#define EVENT_STATUS_OTHER (1 << EVENT_BIT_OTHER) > + > +#define FIELD_DEPTH_TYPE 0 > +#define FIELD_DEPTH_NAME 1 > +#define FIELD_DEPTH_SIZE 2 > +#define FIELD_DEPTH_OFFSET 3 > + > +/* > + * Limits how many trace_event calls user processes can create: > + * Must be multiple of PAGE_SIZE. > + */ > +#define MAX_PAGES 1 > +#define MAX_EVENTS (MAX_PAGES * PAGE_SIZE) > + > +/* Limit how long of an event name plus args within the subsystem. */ > +#define MAX_EVENT_DESC 512 > +#define EVENT_NAME(user_event) ((user_event)->tracepoint.name) > + > +#define DIAG_IOC_MAGIC '*' > +#define DIAG_IOCSREG _IOW(DIAG_IOC_MAGIC, 0, char*) > +#define DIAG_IOCSDEL _IOW(DIAG_IOC_MAGIC, 1, char*) > +#define DIAG_IOCQLOCOFFSET _IO(DIAG_IOC_MAGIC, 2) These obviously will need to go into a user abi header file. > + > +static char *register_page_data; > + > +static DEFINE_HASHTABLE(register_table, 4); > +static DECLARE_BITMAP(page_bitmap, MAX_EVENTS); > + > +struct user_event { > + struct tracepoint tracepoint; > + struct trace_event_call call; > + struct trace_event_class class; > + struct dyn_event devent; > + struct hlist_node node; > + atomic_t refs; > + int index; > + char *args; > +}; > + > +#ifdef CONFIG_PERF_EVENTS > +struct user_bpf_context { > + int udatalen; > + const char __user *udata; > +}; > +#endif > + > +typedef void (*user_event_func_t) (struct user_event *user, > + const char __user *udata, > + size_t udatalen, void *tpdata); > + > +static int register_user_event(char *name, char *args, > + struct user_event **newuser); > + [..] > +static int __init trace_events_user_init(void) > +{ > + int ret; > + > + /* Zero all bits beside 0 (which is reserved for failures) */ > + bitmap_zero(page_bitmap, MAX_EVENTS); > + set_bit(0, page_bitmap); > + > + register_page_data = kmalloc(MAX_EVENTS, GFP_KERNEL); You want "kzalloc" here. Because when I read the map without adding anything, I get: printf("%lx\n", *(unsigned long *)page_data); Produces: ffffffff9065004e But if I convert it to kzalloc() it gives me: 0 Thus, you are exposing stale memory. If you want to expose this to non-admin users, this is a major security leak. -- Steve > + > + if (!register_page_data) > + return -ENOMEM; > + > + set_page_reservations(true); > + > + ret = create_user_tracefs(); > + > + if (ret) { > + pr_warn("user_events could not register with tracefs\n"); > + set_page_reservations(false); > + kfree(register_page_data); > + return ret; > + } > + > + if (dyn_event_register(&user_event_dops)) > + pr_warn("user_events could not register with dyn_events\n"); > + > + return 0; > +} > + > +fs_initcall(trace_events_user_init);
On Wed, Oct 06, 2021 at 12:54:41PM -0400, Steven Rostedt wrote: > > Psuedo code example of typical usage: > > page_fd = open("user_events_mmap", O_RDWR); > > page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0); > > > > data_fd = open("user_events_data", O_RDWR); > > data_id = ioctl(data_fd, DIAG_IOCSREG, "test"); > > > > if (page_data[data_id]) write(data_fd, &payload, sizeof(payload)); > > What is the type of "page_data". I'd like to test it before accepting it. > > From playing around, I see that page_data is of type char *. Yes, it is char *. I'll make this clear in the next patch version description. > > +/* Bits 0-6 are for known probe types, Bit 7 is for unknown probes */ > > +#define EVENT_BIT_FTRACE 0 > > +#define EVENT_BIT_PERF 1 > > +#define EVENT_BIT_OTHER 7 > > + > > +#define EVENT_STATUS_FTRACE (1 << EVENT_BIT_FTRACE) > > +#define EVENT_STATUS_PERF (1 << EVENT_BIT_PERF) > > +#define EVENT_STATUS_OTHER (1 << EVENT_BIT_OTHER) ... > > +#define DIAG_IOC_MAGIC '*' > > +#define DIAG_IOCSREG _IOW(DIAG_IOC_MAGIC, 0, char*) > > +#define DIAG_IOCSDEL _IOW(DIAG_IOC_MAGIC, 1, char*) > > +#define DIAG_IOCQLOCOFFSET _IO(DIAG_IOC_MAGIC, 2) > > These obviously will need to go into a user abi header file. > Yes, I'm glad you mentioned it. I wasn't entirely sure where it should live. Is there precedent on where to put these so they span both kernel and user for discovery / distribution? > > + > > +static char *register_page_data; > > + > > +static DEFINE_HASHTABLE(register_table, 4); > > +static DECLARE_BITMAP(page_bitmap, MAX_EVENTS); > > + > > +struct user_event { > > + struct tracepoint tracepoint; > > + struct trace_event_call call; > > + struct trace_event_class class; > > + struct dyn_event devent; > > + struct hlist_node node; > > + atomic_t refs; > > + int index; > > + char *args; > > +}; > > + > > +#ifdef CONFIG_PERF_EVENTS > > +struct user_bpf_context { > > + int udatalen; > > + const char __user *udata; > > +}; > > +#endif > > + > > +typedef void (*user_event_func_t) (struct user_event *user, > > + const char __user *udata, > > + size_t udatalen, void *tpdata); > > + > > +static int register_user_event(char *name, char *args, > > + struct user_event **newuser); > > + > > [..] > Is the ask here to get user_bpf_context definition also into a user ABI header? (I took it as that). > > +static int __init trace_events_user_init(void) > > +{ > > + int ret; > > + > > + /* Zero all bits beside 0 (which is reserved for failures) */ > > + bitmap_zero(page_bitmap, MAX_EVENTS); > > + set_bit(0, page_bitmap); > > + > > + register_page_data = kmalloc(MAX_EVENTS, GFP_KERNEL); > > You want "kzalloc" here. Because when I read the map without adding > anything, I get: > > printf("%lx\n", *(unsigned long *)page_data); > > Produces: > > ffffffff9065004e > > But if I convert it to kzalloc() it gives me: > > 0 > > Thus, you are exposing stale memory. If you want to expose this to > non-admin users, this is a major security leak. > > -- Steve > Oops, sorry about that! Thanks, -Beau
On Wed, 6 Oct 2021 10:27:23 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > On Wed, Oct 06, 2021 at 12:54:41PM -0400, Steven Rostedt wrote: > > > Psuedo code example of typical usage: > > > page_fd = open("user_events_mmap", O_RDWR); > > > page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0); > > > > > > data_fd = open("user_events_data", O_RDWR); > > > data_id = ioctl(data_fd, DIAG_IOCSREG, "test"); > > > > > > if (page_data[data_id]) write(data_fd, &payload, sizeof(payload)); > > > > What is the type of "page_data". I'd like to test it before accepting it. > > > > From playing around, I see that page_data is of type char *. > Yes, it is char *. I'll make this clear in the next patch version > description. Thanks. > > > > +/* Bits 0-6 are for known probe types, Bit 7 is for unknown probes */ > > > +#define EVENT_BIT_FTRACE 0 > > > +#define EVENT_BIT_PERF 1 > > > +#define EVENT_BIT_OTHER 7 > > > + > > > +#define EVENT_STATUS_FTRACE (1 << EVENT_BIT_FTRACE) > > > +#define EVENT_STATUS_PERF (1 << EVENT_BIT_PERF) > > > +#define EVENT_STATUS_OTHER (1 << EVENT_BIT_OTHER) > ... > > > +#define DIAG_IOC_MAGIC '*' > > > +#define DIAG_IOCSREG _IOW(DIAG_IOC_MAGIC, 0, char*) > > > +#define DIAG_IOCSDEL _IOW(DIAG_IOC_MAGIC, 1, char*) > > > +#define DIAG_IOCQLOCOFFSET _IO(DIAG_IOC_MAGIC, 2) > > > > These obviously will need to go into a user abi header file. > > > Yes, I'm glad you mentioned it. I wasn't entirely sure where it should > live. Is there precedent on where to put these so they span both kernel > and user for discovery / distribution? There is a include/uapi directory in the Linux source code. I've never added to it (that I remember, but maybe I have? Wouldn't surprise me if I did and forgot about it :-p Sucks getting old). > > > > + > > > +static char *register_page_data; > > > + > > > +static DEFINE_HASHTABLE(register_table, 4); > > > +static DECLARE_BITMAP(page_bitmap, MAX_EVENTS); > > > + > > > +struct user_event { > > > + struct tracepoint tracepoint; > > > + struct trace_event_call call; > > > + struct trace_event_class class; > > > + struct dyn_event devent; > > > + struct hlist_node node; > > > + atomic_t refs; > > > + int index; > > > + char *args; > > > +}; > > > + > > > +#ifdef CONFIG_PERF_EVENTS > > > +struct user_bpf_context { > > > + int udatalen; > > > + const char __user *udata; > > > +}; > > > +#endif > > > + > > > +typedef void (*user_event_func_t) (struct user_event *user, > > > + const char __user *udata, > > > + size_t udatalen, void *tpdata); > > > + > > > +static int register_user_event(char *name, char *args, > > > + struct user_event **newuser); > > > + > > > > [..] > > > Is the ask here to get user_bpf_context definition also into a user ABI > header? (I took it as that). Not sure I understand the question. > > > > +static int __init trace_events_user_init(void) > > > +{ > > > + int ret; > > > + > > > + /* Zero all bits beside 0 (which is reserved for failures) */ > > > + bitmap_zero(page_bitmap, MAX_EVENTS); > > > + set_bit(0, page_bitmap); > > > + > > > + register_page_data = kmalloc(MAX_EVENTS, GFP_KERNEL); > > > > You want "kzalloc" here. Because when I read the map without adding > > anything, I get: > > > > printf("%lx\n", *(unsigned long *)page_data); > > > > Produces: > > > > ffffffff9065004e > > > > But if I convert it to kzalloc() it gives me: > > > > 0 > > > > Thus, you are exposing stale memory. If you want to expose this to > > non-admin users, this is a major security leak. > > > > -- Steve > > > Oops, sorry about that! No problem. Just pointing it out. I expect that this is going to take a few back and forth to get right. But I do like the way it is heading. -- Steve
On Thu, Oct 07, 2021 at 01:28:27AM +0900, Masami Hiramatsu wrote: > > User processes often want to trace only when it's useful. To enable this > > a set of pages are mapped into the user process space that indicate the > > current state of the user events that have been registered. User > > processes can check if their event is hooked to a trace/probe, and if it > > is, emit the event data out via the write() syscall. > > > > Two files are added to tracefs to accomplish this: > > user_events_mmap - This file is mmap'd into participating user mode > > processes to indicate event status. > > It seems like the "method" is used for the file name of user_events_mmap, > instead of what is the purpose of the file. > What about "user_events_status" and show the informations (which program > made events and what tracer is using the events) when you read the file? > And when you mmap it, or when you ioctl to get mmapable fd, your program > can directly monitor the flags. > This sounds good to me. > > Psuedo code example of typical usage: > > page_fd = open("user_events_mmap", O_RDWR); > > page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0); > > > > data_fd = open("user_events_data", O_RDWR); > > data_id = ioctl(data_fd, DIAG_IOCSREG, "test"); > > Hmm, if the data_id is same as the ID in events/*/*/format, it will > be queried by libtraceevent or libftrace, isn't it? > And also, if you can define the user-event via dynamic_event interface, > it should be used instead of using ioctl on data channel. > These will not be the same ID. The data_id will be the offset within a page to check for status. We don't want to give user mode processes access to the entire tracefs, so we are isolating to a few files that will get wider access. > Oh, I thought that the each user events should not be accessible from other > processes. And I got the reason, if you run different processes (instances) > at the same time, those may need to share same events. > > In summary, I would like to suggest following interface to split > the control, monitor and data channels. > > - dynamic_events : Add new 'u' command but derived from synth event (or share the parser) > - user_event_status : monitor status by mmap, and showing human-readable status by read. > query offset by ioctl. Agree with the above, cool idea. > - user_event_data : just write data (binary packet). > We still require the ioctl to register and cannot bet solely on dynamic_events. We want a fast way to register events on process startup and get the data_id back as fast as possible. The other thing is we need ref counting to know if the event is busy. Having the ID in the packet avoids having a fd per-event, but it also makes ref counting process lifetime of each event quite hard. > User written binary packet will be based on the event format, for example, > > echo "u:foo u64 mydata; char[] msg" > > /sys/kernel/debug/tracing # cat events/user_events/foo/format > name: foo > ID: 1234 > format: > field:unsigned short common_type; offset:0; size:2; signed:0; > field:unsigned char common_flags; offset:2; size:1; signed:0; > field:unsigned char common_preempt_count; offset:3; size:1; signed:0; > field:int common_pid; offset:4; size:4; signed:1; > > field:u64 mydata; offset:8; size:8; signed:0; > field:__data_loc char[] msg; offset:16; size:4; signed:1; > > Then, the binary should be formatted as; > > [u32 ID][u64 mydata][u16 msg-len][u16 msg-offs][msg-data] > > Then user will generate a packet like; > > [1234][0xdeadbeef][12][16]["Hello world"] > > And write it to the user_event_data file. > > Kernel will verify len and offs if there is a dynamic array (string), > and record common fields, update msg-offs (because it will be shift the > size of common fields) and record user-data. > We want to avoid the kernel component touching any of the user data. For example eBPF programs will parse the data directly from definitions provided by the user. If the kernel side changed what the user expected from their side the eBPF program would fail to decode. We also want predicate filtering to work as cheap as possible. I would really like to keep offset manipulation entirely in the user space to avoid confusion across the various tracing mechanisms and avoid probing the user data upon each call (eBPF programs only selectively probe in data). > I also think we need a user-mode library for the above interface > so that user can easily bind to each application and language. I have a few colleagues testing some ideas out about this. Should I have them reach out to you for ideas or collaboration? Thanks, -Beau
Hi Beau, On Wed, 6 Oct 2021 10:56:11 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > On Thu, Oct 07, 2021 at 01:28:27AM +0900, Masami Hiramatsu wrote: > > > User processes often want to trace only when it's useful. To enable this > > > a set of pages are mapped into the user process space that indicate the > > > current state of the user events that have been registered. User > > > processes can check if their event is hooked to a trace/probe, and if it > > > is, emit the event data out via the write() syscall. > > > > > > Two files are added to tracefs to accomplish this: > > > user_events_mmap - This file is mmap'd into participating user mode > > > processes to indicate event status. > > > > It seems like the "method" is used for the file name of user_events_mmap, > > instead of what is the purpose of the file. > > What about "user_events_status" and show the informations (which program > > made events and what tracer is using the events) when you read the file? > > And when you mmap it, or when you ioctl to get mmapable fd, your program > > can directly monitor the flags. > > > This sounds good to me. > > > > Psuedo code example of typical usage: > > > page_fd = open("user_events_mmap", O_RDWR); > > > page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0); > > > > > > data_fd = open("user_events_data", O_RDWR); > > > data_id = ioctl(data_fd, DIAG_IOCSREG, "test"); > > > > Hmm, if the data_id is same as the ID in events/*/*/format, it will > > be queried by libtraceevent or libftrace, isn't it? > > And also, if you can define the user-event via dynamic_event interface, > > it should be used instead of using ioctl on data channel. > > > These will not be the same ID. The data_id will be the offset within a > page to check for status. We don't want to give user mode processes > access to the entire tracefs, so we are isolating to a few files that > will get wider access. OK, but how to query the id of the event which has been made from writing dynamic_events file? (maybe query from name by ioctl?) > > Oh, I thought that the each user events should not be accessible from other > > processes. And I got the reason, if you run different processes (instances) > > at the same time, those may need to share same events. > > > > In summary, I would like to suggest following interface to split > > the control, monitor and data channels. > > > > - dynamic_events : Add new 'u' command but derived from synth event (or share the parser) > > - user_event_status : monitor status by mmap, and showing human-readable status by read. > > query offset by ioctl. > Agree with the above, cool idea. > > > - user_event_data : just write data (binary packet). > > > We still require the ioctl to register and cannot bet solely on > dynamic_events. We want a fast way to register events on process startup > and get the data_id back as fast as possible. What about using ioctl on 'user_event_status' file? But in this case, the name is not showing the function. Then, what about renaming it as simply 'user_events'? :) This file will show the current user events as similar to synth_events, in addition, it will show the current status in "# comment". $ echo "user1 u32 arg1; u64 arg2" > user_events $ cat user_events user1 u32 arg1; u64 arg2 $ echo 1 > events/user_events/user1/enable $ cat user_events user1 u32 arg1; u64 arg2 # Used by ftrace In addition, - user-process can do ioctl() to query offset instead of id from name. - user-process can do mmap() the file to monitor the status. > The other thing is we need ref counting to know if the event is busy. > Having the ID in the packet avoids having a fd per-event, but it also > makes ref counting process lifetime of each event quite hard. Hmm, I don't think so. You can use an array of the pointer to events on the private data of the struct file. When you add (or start using) an event (this is identified by ioctl), you can increment the event refcount and add it to the array. When the file is closed (in exiting process), it will loop on the array and decrement the refcount for each event. Then, after all tracers disabled the event, ftrace can remove the event in background (unless it is defined through 'dynamic_events' or 'user_events'). > > User written binary packet will be based on the event format, for example, > > > > echo "u:foo u64 mydata; char[] msg" > > > > /sys/kernel/debug/tracing # cat events/user_events/foo/format > > name: foo > > ID: 1234 > > format: > > field:unsigned short common_type; offset:0; size:2; signed:0; > > field:unsigned char common_flags; offset:2; size:1; signed:0; > > field:unsigned char common_preempt_count; offset:3; size:1; signed:0; > > field:int common_pid; offset:4; size:4; signed:1; > > > > field:u64 mydata; offset:8; size:8; signed:0; > > field:__data_loc char[] msg; offset:16; size:4; signed:1; > > > > Then, the binary should be formatted as; > > > > [u32 ID][u64 mydata][u16 msg-len][u16 msg-offs][msg-data] > > > > Then user will generate a packet like; > > > > [1234][0xdeadbeef][12][16]["Hello world"] > > > > And write it to the user_event_data file. > > > > Kernel will verify len and offs if there is a dynamic array (string), > > and record common fields, update msg-offs (because it will be shift the > > size of common fields) and record user-data. > > > We want to avoid the kernel component touching any of the user data. For > example eBPF programs will parse the data directly from definitions > provided by the user. If the kernel side changed what the user expected > from their side the eBPF program would fail to decode. Sorry, that's not doable unless we introduce a new __rel_loc__ attribute to the ftrace (and libtraceevent). With that, the dynamic data (like string) offset can be written as the offset from itself. > > [1234][0xdeadbeef][12][0]["Hello world"] Anyway, as far as you are using trace event the data format itself is the contract between kernel and user, because *kernel* also has to decode it (e.g. user will read the 'trace' file.) and other tools like perf has to decode it too. The kernel will not modify the data (again with __rel_loc__ attr) but it will add the common_* field and event ID as a header to its buffer when record it. > We also want > predicate filtering to work as cheap as possible. I would really like to > keep offset manipulation entirely in the user space to avoid confusion > across the various tracing mechanisms and avoid probing the user data > upon each call (eBPF programs only selectively probe in data). OK, so let's add __rel_loc__ attribute. The rel_loc type will be struct rel_loc { uint16_t len; /* The data size (including '\0' if string )*/ uint16_t offs; /* The offset of actual data from this field */ } __packed; Hmm, btw, this will be good for probe events... I don't need to pass the base address with this attribute. > > > I also think we need a user-mode library for the above interface > > so that user can easily bind to each application and language. > I have a few colleagues testing some ideas out about this. Should I have > them reach out to you for ideas or collaboration? Yeah, we always collaborate on LKML and other MLs :) Thank you, > > Thanks, > -Beau
On Thu, Oct 07, 2021 at 11:17:38PM +0900, Masami Hiramatsu wrote: > > > > Psuedo code example of typical usage: > > > > page_fd = open("user_events_mmap", O_RDWR); > > > > page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0); > > > > > > > > data_fd = open("user_events_data", O_RDWR); > > > > data_id = ioctl(data_fd, DIAG_IOCSREG, "test"); > > > > > > Hmm, if the data_id is same as the ID in events/*/*/format, it will > > > be queried by libtraceevent or libftrace, isn't it? > > > And also, if you can define the user-event via dynamic_event interface, > > > it should be used instead of using ioctl on data channel. > > > > > These will not be the same ID. The data_id will be the offset within a > > page to check for status. We don't want to give user mode processes > > access to the entire tracefs, so we are isolating to a few files that > > will get wider access. > > OK, but how to query the id of the event which has been made > from writing dynamic_events file? (maybe query from name by > ioctl?) > Yes, you can either do ioctl or with the v2 patch you can now just cat user_events_status as you indicated previously to do :) > > We still require the ioctl to register and cannot bet solely on > > dynamic_events. We want a fast way to register events on process startup > > and get the data_id back as fast as possible. > > What about using ioctl on 'user_event_status' file? > But in this case, the name is not showing the function. > > Then, what about renaming it as simply 'user_events'? :) > This file will show the current user events as similar to synth_events, > in addition, it will show the current status in "# comment". > > $ echo "user1 u32 arg1; u64 arg2" > user_events > $ cat user_events > user1 u32 arg1; u64 arg2 > $ echo 1 > events/user_events/user1/enable > $ cat user_events > user1 u32 arg1; u64 arg2 # Used by ftrace > > In addition, > - user-process can do ioctl() to query offset instead of id from name. > - user-process can do mmap() the file to monitor the status. > Please see v2 patch, I do this pattern except it's '(Used by ftrace)' instead of '# Used by ftrace'. Format is id:name status > > The other thing is we need ref counting to know if the event is busy. > > Having the ID in the packet avoids having a fd per-event, but it also > > makes ref counting process lifetime of each event quite hard. > > Hmm, I don't think so. You can use an array of the pointer to > events on the private data of the struct file. > When you add (or start using) an event (this is identified by ioctl), > you can increment the event refcount and add it to the array. > When the file is closed (in exiting process), it will loop on the > array and decrement the refcount for each event. > Then, after all tracers disabled the event, ftrace can remove the > event in background (unless it is defined through 'dynamic_events' or > 'user_events'). > Yes, I didn't say it's impossible :) It's quite hard and takes a lot more management. I don't see a clear benefit to that approach, why is it better than an fd lifetime? Not trying to be difficult, just trying to be pragmatic about what approach is best. > > We want to avoid the kernel component touching any of the user data. For > > example eBPF programs will parse the data directly from definitions > > provided by the user. If the kernel side changed what the user expected > > from their side the eBPF program would fail to decode. > > Sorry, that's not doable unless we introduce a new __rel_loc__ attribute > to the ftrace (and libtraceevent). With that, the dynamic data > (like string) offset can be written as the offset from itself. > With this patch I have no problem getting hist, trigger and filter to work since the user is providing all correct offsets. Can you help me understand why this won't work? (It appears to work well without the kernel modifying user provided memory). > > > [1234][0xdeadbeef][12][0]["Hello world"] > > Anyway, as far as you are using trace event the data format itself is > the contract between kernel and user, because *kernel* also has to > decode it (e.g. user will read the 'trace' file.) and other tools > like perf has to decode it too. > > The kernel will not modify the data (again with __rel_loc__ attr) but > it will add the common_* field and event ID as a header to its buffer > when record it. > Adding the common fields is fine, and works well with the current patch. I've been able to decode in ftrace, perf and eBPF without any issues for common types. __data_loc needs user mode cooperation, but no other type does. I've also tested hist, filter and trigger and all appear to work as I understand them to. > > We also want > > predicate filtering to work as cheap as possible. I would really like to > > keep offset manipulation entirely in the user space to avoid confusion > > across the various tracing mechanisms and avoid probing the user data > > upon each call (eBPF programs only selectively probe in data). > > OK, so let's add __rel_loc__ attribute. The rel_loc type will be > > struct rel_loc { > uint16_t len; /* The data size (including '\0' if string )*/ > uint16_t offs; /* The offset of actual data from this field */ > } __packed; > > Hmm, btw, this will be good for probe events... I don't need to pass > the base address with this attribute. > What's the difference between __rel_loc__ and __data_loc? Seems like instead of just offset it's length + offset? Thanks, -Beau
On Thu, 7 Oct 2021 09:22:04 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > On Thu, Oct 07, 2021 at 11:17:38PM +0900, Masami Hiramatsu wrote: > > > > > Psuedo code example of typical usage: > > > > > page_fd = open("user_events_mmap", O_RDWR); > > > > > page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0); > > > > > > > > > > data_fd = open("user_events_data", O_RDWR); > > > > > data_id = ioctl(data_fd, DIAG_IOCSREG, "test"); > > > > > > > > Hmm, if the data_id is same as the ID in events/*/*/format, it will > > > > be queried by libtraceevent or libftrace, isn't it? > > > > And also, if you can define the user-event via dynamic_event interface, > > > > it should be used instead of using ioctl on data channel. > > > > > > > These will not be the same ID. The data_id will be the offset within a > > > page to check for status. We don't want to give user mode processes > > > access to the entire tracefs, so we are isolating to a few files that > > > will get wider access. > > > > OK, but how to query the id of the event which has been made > > from writing dynamic_events file? (maybe query from name by > > ioctl?) > > > Yes, you can either do ioctl or with the v2 patch you can now just cat > user_events_status as you indicated previously to do :) OK, let me check the v2. > > > > We still require the ioctl to register and cannot bet solely on > > > dynamic_events. We want a fast way to register events on process startup > > > and get the data_id back as fast as possible. > > > > What about using ioctl on 'user_event_status' file? > > But in this case, the name is not showing the function. > > > > Then, what about renaming it as simply 'user_events'? :) > > This file will show the current user events as similar to synth_events, > > in addition, it will show the current status in "# comment". > > > > $ echo "user1 u32 arg1; u64 arg2" > user_events > > $ cat user_events > > user1 u32 arg1; u64 arg2 > > $ echo 1 > events/user_events/user1/enable > > $ cat user_events > > user1 u32 arg1; u64 arg2 # Used by ftrace > > > > In addition, > > - user-process can do ioctl() to query offset instead of id from name. > > - user-process can do mmap() the file to monitor the status. > > > Please see v2 patch, I do this pattern except it's '(Used by ftrace)' > instead of '# Used by ftrace'. > > Format is id:name status Hm, why I suggested to use "# status" is that the comment will be removed when writing it. So you can do cat user_events > ~/saved_events (reboot) cat ~/saved_events > user_events to restore events :) > > > > The other thing is we need ref counting to know if the event is busy. > > > Having the ID in the packet avoids having a fd per-event, but it also > > > makes ref counting process lifetime of each event quite hard. > > > > Hmm, I don't think so. You can use an array of the pointer to > > events on the private data of the struct file. > > When you add (or start using) an event (this is identified by ioctl), > > you can increment the event refcount and add it to the array. > > When the file is closed (in exiting process), it will loop on the > > array and decrement the refcount for each event. > > Then, after all tracers disabled the event, ftrace can remove the > > event in background (unless it is defined through 'dynamic_events' or > > 'user_events'). > > > Yes, I didn't say it's impossible :) It's quite hard and takes a lot > more management. I don't see a clear benefit to that approach, why is it > better than an fd lifetime? Not trying to be difficult, just trying to > be pragmatic about what approach is best. I'm not sure this point, you mean 1 fd == 1 event model? > > > We want to avoid the kernel component touching any of the user data. For > > > example eBPF programs will parse the data directly from definitions > > > provided by the user. If the kernel side changed what the user expected > > > from their side the eBPF program would fail to decode. > > > > Sorry, that's not doable unless we introduce a new __rel_loc__ attribute > > to the ftrace (and libtraceevent). With that, the dynamic data > > (like string) offset can be written as the offset from itself. > > > With this patch I have no problem getting hist, trigger and filter to > work since the user is providing all correct offsets. Can you help me > understand why this won't work? (It appears to work well without the > kernel modifying user provided memory). OK, let me check it. > > > > > [1234][0xdeadbeef][12][0]["Hello world"] > > > > Anyway, as far as you are using trace event the data format itself is > > the contract between kernel and user, because *kernel* also has to > > decode it (e.g. user will read the 'trace' file.) and other tools > > like perf has to decode it too. > > > > The kernel will not modify the data (again with __rel_loc__ attr) but > > it will add the common_* field and event ID as a header to its buffer > > when record it. > > > Adding the common fields is fine, and works well with the current patch. > I've been able to decode in ftrace, perf and eBPF without any issues for > common types. __data_loc needs user mode cooperation, but no other type > does. I've also tested hist, filter and trigger and all appear to work > as I understand them to. __data_loc has the data offset field, which is the offset from the entry of the recorded event on the ring buffer. Since there are common fields, the offset is not the offset from user given buffer. So it must be changed by kernel. (Note that common fields expected to exist in all events, so that user scripts can process the fields commonly) > > > > We also want > > > predicate filtering to work as cheap as possible. I would really like to > > > keep offset manipulation entirely in the user space to avoid confusion > > > across the various tracing mechanisms and avoid probing the user data > > > upon each call (eBPF programs only selectively probe in data). > > > > OK, so let's add __rel_loc__ attribute. The rel_loc type will be > > > > struct rel_loc { > > uint16_t len; /* The data size (including '\0' if string )*/ > > uint16_t offs; /* The offset of actual data from this field */ > > } __packed; > > > > Hmm, btw, this will be good for probe events... I don't need to pass > > the base address with this attribute. > > > What's the difference between __rel_loc__ and __data_loc? Seems like > instead of just offset it's length + offset? In my idea, rel_loc is similar to the data_loc. It has the offset, but the offset is the data offset from the rel_loc, not from the entry of the recorded data. So kernel doesn't need to adjust it. Thank you,
On Fri, Oct 08, 2021 at 08:12:49AM +0900, Masami Hiramatsu wrote: > > Please see v2 patch, I do this pattern except it's '(Used by ftrace)' > > instead of '# Used by ftrace'. > > > > Format is id:name status > > Hm, why I suggested to use "# status" is that the comment will be > removed when writing it. So you can do > > cat user_events > ~/saved_events > (reboot) > cat ~/saved_events > user_events > > to restore events :) > Nice, good idea. > > > > > > The other thing is we need ref counting to know if the event is busy. > > > > Having the ID in the packet avoids having a fd per-event, but it also > > > > makes ref counting process lifetime of each event quite hard. > > > > > > Hmm, I don't think so. You can use an array of the pointer to > > > events on the private data of the struct file. > > > When you add (or start using) an event (this is identified by ioctl), > > > you can increment the event refcount and add it to the array. > > > When the file is closed (in exiting process), it will loop on the > > > array and decrement the refcount for each event. > > > Then, after all tracers disabled the event, ftrace can remove the > > > event in background (unless it is defined through 'dynamic_events' or > > > 'user_events'). > > > > > Yes, I didn't say it's impossible :) It's quite hard and takes a lot > > more management. I don't see a clear benefit to that approach, why is it > > better than an fd lifetime? Not trying to be difficult, just trying to > > be pragmatic about what approach is best. > > I'm not sure this point, you mean 1 fd == 1 event model? > Yeah, I like the idea of not having an fd per event. I want to make sure the complexity is worth it. Is the overhead of an FD per event in user space too much? What happens to the first 4 bytes (ID)? Does it not show up in the buffer? This would be fine as long as the rel_loc idea gets into ftrace, etc. This would require a global array as well as a local per-FD array. I'm wondering if the per-FD array becoming large mitigates the gain by simply having an FD per-event. > > > > We also want > > > > predicate filtering to work as cheap as possible. I would really like to > > > > keep offset manipulation entirely in the user space to avoid confusion > > > > across the various tracing mechanisms and avoid probing the user data > > > > upon each call (eBPF programs only selectively probe in data). > > > > > > OK, so let's add __rel_loc__ attribute. The rel_loc type will be > > > > > > struct rel_loc { > > > uint16_t len; /* The data size (including '\0' if string )*/ > > > uint16_t offs; /* The offset of actual data from this field */ > > > } __packed; > > > > > > Hmm, btw, this will be good for probe events... I don't need to pass > > > the base address with this attribute. > > > > > What's the difference between __rel_loc__ and __data_loc? Seems like > > instead of just offset it's length + offset? > > In my idea, rel_loc is similar to the data_loc. It has the offset, but > the offset is the data offset from the rel_loc, not from the entry of > the recorded data. So kernel doesn't need to adjust it. > Got it, makes sense and would eliminate the need for the IOCTL for offsets. I like it. Thanks, -Beau > Thank you, > > -- > Masami Hiramatsu <mhiramat@kernel.org>
On Thu, 7 Oct 2021 17:05:40 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > > > The other thing is we need ref counting to know if the event is busy. > > > > > Having the ID in the packet avoids having a fd per-event, but it also > > > > > makes ref counting process lifetime of each event quite hard. > > > > > > > > Hmm, I don't think so. You can use an array of the pointer to > > > > events on the private data of the struct file. > > > > When you add (or start using) an event (this is identified by ioctl), > > > > you can increment the event refcount and add it to the array. > > > > When the file is closed (in exiting process), it will loop on the > > > > array and decrement the refcount for each event. > > > > Then, after all tracers disabled the event, ftrace can remove the > > > > event in background (unless it is defined through 'dynamic_events' or > > > > 'user_events'). > > > > > > > Yes, I didn't say it's impossible :) It's quite hard and takes a lot > > > more management. I don't see a clear benefit to that approach, why is it > > > better than an fd lifetime? Not trying to be difficult, just trying to > > > be pragmatic about what approach is best. > > > > I'm not sure this point, you mean 1 fd == 1 event model? > > > Yeah, I like the idea of not having an fd per event. Ah, OK. I misunderstood the idea. per-FD model sounds like having events/user-events/*/marker file. > I want to make > sure the complexity is worth it. Is the overhead of an FD per event in > user space too much? It depends on the use case, how much events you wants to use with the user-events. If there are hundreds of the evets, that will consume kernel resources and /proc/*/fd/ will be filled with the event's fds. But if there is a few events, I think no problem. > What happens to the first 4 bytes (ID)? Does it not > show up in the buffer? You can add the 'ID' field commonly in the user-event by default if you need it. Or, just skip the ID as it is a header of the packet. (since the ID is process local number, that will not important for the tracers who trace the events by name) > This would be fine as long as the rel_loc idea > gets into ftrace, etc. > > This would require a global array as well as a local per-FD array. I'm > wondering if the per-FD array becoming large mitigates the gain by > simply having an FD per-event. OK, I got it. I hope no one adds hundreds of events at once for trace. Thank you,
On 10/6/21 12:44 AM, Beau Belgrave wrote: > User mode processes that wish to use trace events to get data into > ftrace, perf, eBPF, etc are limited to uprobes today. The user events > features enables an ABI for user mode processes to create and write to > trace events that are isolated from kernel level trace events. This > enables a faster path for tracing from user mode data as well as opens > managed code to participate in trace events, where stub locations are > dynamic. Is this not very much what the trace_marker do?
On Fri, Oct 08, 2021 at 01:11:59PM +0000, Peter.Enderborg@sony.com wrote: > On 10/6/21 12:44 AM, Beau Belgrave wrote: > > User mode processes that wish to use trace events to get data into > > ftrace, perf, eBPF, etc are limited to uprobes today. The user events > > features enables an ABI for user mode processes to create and write to > > trace events that are isolated from kernel level trace events. This > > enables a faster path for tracing from user mode data as well as opens > > managed code to participate in trace events, where stub locations are > > dynamic. > > Is this not very much what the trace_marker do? > At a very high level, yes, both get user data into ftrace. This question has been brought up a few times, if you watch the LPC2021 Tracing MC session this came up and got answered. Markers do not get user data into perf and eBPF, nor do they allow user mode applications to know when to emit the trace_marker (we only want to trace and incur the syscall cost when something requests that data). We also want to be able to use all the bells and whistles of ftrace/perf. This means supporting field labels so things like hist, filter and triggers work on a per-event basis (IE: Durable identifier such as the event name). Thanks, -Beau
On Fri, Oct 08, 2021 at 06:22:58PM +0900, Masami Hiramatsu wrote: > > > I'm not sure this point, you mean 1 fd == 1 event model? > > > > > Yeah, I like the idea of not having an fd per event. > > Ah, OK. I misunderstood the idea. > per-FD model sounds like having events/user-events/*/marker file. > Thanks for the back and forth, I appreciate your time on this. Yes, in my mind there are two options to avoid kernel memory usage per-event. 1. We have a an array per file struct that is independently ref-counted. This is required to ensure lifetime requirements and to ensure user code cannot access other user events that might have been free'd outside of the lifetime and cause a kernel crash. This approach also requires 2 int's to be returned, 1 for the status page the other a local index for the write into the above array per-file struct. This is likely the most complex method due to it's lifetime and RCU synchronization requirements. However, it represents the least memory to both kernel and user space. 2. We have a anon_inode FD that gets installed into the user process and returned via the ioctl from user_events tracefs file. The file struct backing the FD is shared by all user mode processes for that event. Like having an inject/marker file per-event in the user_events subsystem. This approach requires an FD returned and either an int for the status page or the returend FD could expose the ID via another IOCTL being issued. This is the simplest method since the FD manages the lifetime, when FD is released so is the shared file struct. Kernel side memory is reduced to only unique events that are actively being used. There is no RCU or synchronization beyond the FD lifetime. The user mode processes does incur an FD per-event within their file description table. So they events charge against their FD per-process limit (not necessarily a bad thing). This also seems to follow the pre-existing patterns of tracefs (trace_marker, inject, format, etc all have a shared file available to user-processes that have been granted access). For our case, we want that, but we want it on a access boundary to who all have access to the user_events_* tracefs files. We don't want to open up all of tracefs widely. > > I want to make > > sure the complexity is worth it. Is the overhead of an FD per event in > > user space too much? > > It depends on the use case, how much events you wants to use with > the user-events. If there are hundreds of the evets, that will consume > kernel resources and /proc/*/fd/ will be filled with the event's fds. > But if there is a few events, I think no problem. > In our own use case this will be low due to the way we plan to use the events. However, I am not sure others will follow that :) Thanks, -Beau
On Mon, 11 Oct 2021 09:25:23 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > > Yes, in my mind there are two options to avoid kernel memory usage > per-event. > > 1. > We have a an array per file struct that is independently ref-counted. > This is required to ensure lifetime requirements and to ensure user code > cannot access other user events that might have been free'd outside of > the lifetime and cause a kernel crash. > > This approach also requires 2 int's to be returned, 1 for the status > page the other a local index for the write into the above array per-file > struct. > > This is likely the most complex method due to it's lifetime and RCU > synchronization requirements. However, it represents the least memory to > both kernel and user space. Does it require RCU synchronization as the updates only happen from user space. But is this for the writing of the event? You want a separate fd for each event to write to, instead of saying you have another interface to write and just pass the given id? > > 2. > We have a anon_inode FD that gets installed into the user process and > returned via the ioctl from user_events tracefs file. The file struct > backing the FD is shared by all user mode processes for that event. Like > having an inject/marker file per-event in the user_events subsystem. > > This approach requires an FD returned and either an int for the status > page or the returend FD could expose the ID via another IOCTL being > issued. > > This is the simplest method since the FD manages the lifetime, when FD > is released so is the shared file struct. Kernel side memory is reduced > to only unique events that are actively being used. There is no RCU or > synchronization beyond the FD lifetime. The user mode processes does > incur an FD per-event within their file description table. So they > events charge against their FD per-process limit (not necessarily a bad > thing). > > This also seems to follow the pre-existing patterns of tracefs > (trace_marker, inject, format, etc all have a shared file available to > user-processes that have been granted access). For our case, we want > that, but we want it on a access boundary to who all have access to the > user_events_* tracefs files. We don't want to open up all of tracefs > widely. > > > > I want to make > > > sure the complexity is worth it. Is the overhead of an FD per event in > > > user space too much? > > > > It depends on the use case, how much events you wants to use with > > the user-events. If there are hundreds of the evets, that will consume > > kernel resources and /proc/*/fd/ will be filled with the event's fds. > > But if there is a few events, I think no problem. > > > In our own use case this will be low due to the way we plan to use the > events. However, I am not sure others will follow that :) I will say, whenever we say this will only have a "few", if it becomes useful, it will end up having many. -- Steve
On Mon, 11 Oct 2021 09:25:23 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > On Fri, Oct 08, 2021 at 06:22:58PM +0900, Masami Hiramatsu wrote: > > > > I'm not sure this point, you mean 1 fd == 1 event model? > > > > > > > Yeah, I like the idea of not having an fd per event. > > > > Ah, OK. I misunderstood the idea. > > per-FD model sounds like having events/user-events/*/marker file. > > > Thanks for the back and forth, I appreciate your time on this. > > Yes, in my mind there are two options to avoid kernel memory usage > per-event. > > 1. > We have a an array per file struct that is independently ref-counted. > This is required to ensure lifetime requirements and to ensure user code > cannot access other user events that might have been free'd outside of > the lifetime and cause a kernel crash. > > This approach also requires 2 int's to be returned, 1 for the status > page the other a local index for the write into the above array per-file > struct. > > This is likely the most complex method due to it's lifetime and RCU > synchronization requirements. However, it represents the least memory to > both kernel and user space. > > 2. > We have a anon_inode FD that gets installed into the user process and > returned via the ioctl from user_events tracefs file. The file struct > backing the FD is shared by all user mode processes for that event. Like > having an inject/marker file per-event in the user_events subsystem. Is it safe to share the same file structure among all processes? (sharing FD via ipc may do same thing?) > This approach requires an FD returned and either an int for the status > page or the returend FD could expose the ID via another IOCTL being > issued. OK, I would like to suggest you to add events/user-events/*/marker file (which returns that shared file struct backed FD) so that some simple user scripts can also send the events (these may not use ioctl, just write the events.) But this can be done afterwards anyway. > This is the simplest method since the FD manages the lifetime, when FD > is released so is the shared file struct. Kernel side memory is reduced > to only unique events that are actively being used. There is no RCU or > synchronization beyond the FD lifetime. The user mode processes does > incur an FD per-event within their file description table. So they > events charge against their FD per-process limit (not necessarily a bad > thing). Yeah, usually FD ulimit will be much bigger than the number of events. > > This also seems to follow the pre-existing patterns of tracefs > (trace_marker, inject, format, etc all have a shared file available to > user-processes that have been granted access). For our case, we want > that, but we want it on a access boundary to who all have access to the > user_events_* tracefs files. We don't want to open up all of tracefs > widely. I think it could be a user choice, and it is possible to add special access rights for user-events. Anyway, this is an advanced item. > > > > I want to make > > > sure the complexity is worth it. Is the overhead of an FD per event in > > > user space too much? > > > > It depends on the use case, how much events you wants to use with > > the user-events. If there are hundreds of the evets, that will consume > > kernel resources and /proc/*/fd/ will be filled with the event's fds. > > But if there is a few events, I think no problem. > > > In our own use case this will be low due to the way we plan to use the > events. However, I am not sure others will follow that :) I just concerned if qemu consider to use this interface for their event log :) Thank you,
On Thu, 14 Oct 2021 00:21:32 +0900 Masami Hiramatsu <mhiramat@kernel.org> wrote: > > This approach requires an FD returned and either an int for the status > > page or the returend FD could expose the ID via another IOCTL being > > issued. > > OK, I would like to suggest you to add events/user-events/*/marker file > (which returns that shared file struct backed FD) so that some simple > user scripts can also send the events (these may not use ioctl, just > write the events.) But this can be done afterwards anyway. > I'd prefer we avoid this. It will break some of the semantics of the events directory. One, only "user-events" will have this "marker" file. Although it will be very similar to the "inject" file, but that's only for debugging anyway. All the files in the events directory is starting to add a bunch of overhead, as they are typically copied into the instances. Although, when we get the eventfs directory created, maybe that will not be as big of a deal. But that still doesn't fix the semantics issue. -- Steve
On Tue, Oct 12, 2021 at 09:18:52PM -0400, Steven Rostedt wrote: > On Mon, 11 Oct 2021 09:25:23 -0700 > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > > Yes, in my mind there are two options to avoid kernel memory usage > > per-event. > > > > 1. > > We have a an array per file struct that is independently ref-counted. > > This is required to ensure lifetime requirements and to ensure user code > > cannot access other user events that might have been free'd outside of > > the lifetime and cause a kernel crash. > > > > This approach also requires 2 int's to be returned, 1 for the status > > page the other a local index for the write into the above array per-file > > struct. > > > > This is likely the most complex method due to it's lifetime and RCU > > synchronization requirements. However, it represents the least memory to > > both kernel and user space. > > Does it require RCU synchronization as the updates only happen from > user space. But is this for the writing of the event? You want a > separate fd for each event to write to, instead of saying you have > another interface to write and just pass the given id? > Yes, an example is a process creates the fd and registers some events. Then the process forks and the child registers another event using the same fd that was inherited. If the original process writes while the child process registers at that point the FD array can get resized / moved, therefore we need RCU deref protection when resizing, etc. I have a few gauntlet tools that try to crash user_events by writing, registering, unregistering at weird times to try to flush this stuff out. > > In our own use case this will be low due to the way we plan to use the > > events. However, I am not sure others will follow that :) > > I will say, whenever we say this will only have a "few", if it becomes > useful, it will end up having many. > > -- Steve Agree 100%, I've gone back and forth on which is better for a while. I'm happy to update to RCU and send out a V3. Want to make sure we have consensus of the right approach before spinning on it :) Thanks, -Beau
On Thu, Oct 14, 2021 at 12:21:32AM +0900, Masami Hiramatsu wrote: > On Mon, 11 Oct 2021 09:25:23 -0700 > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > On Fri, Oct 08, 2021 at 06:22:58PM +0900, Masami Hiramatsu wrote: > > > > > I'm not sure this point, you mean 1 fd == 1 event model? > > > > > > > > > Yeah, I like the idea of not having an fd per event. > > > > > > Ah, OK. I misunderstood the idea. > > > per-FD model sounds like having events/user-events/*/marker file. > > > > > 2. > > We have a anon_inode FD that gets installed into the user process and > > returned via the ioctl from user_events tracefs file. The file struct > > backing the FD is shared by all user mode processes for that event. Like > > having an inject/marker file per-event in the user_events subsystem. > > Is it safe to share the same file structure among all processes? > (sharing FD via ipc may do same thing?) > I believe so, perf_event_open syscall uses this approach. I think sharing among processes would only be a problem if the file_operations methods assumed some synchronization. I don't see how this would be different than a fork inheriting a pre-existing FD. > > > > I want to make > > > > sure the complexity is worth it. Is the overhead of an FD per event in > > > > user space too much? > > > > > > It depends on the use case, how much events you wants to use with > > > the user-events. If there are hundreds of the evets, that will consume > > > kernel resources and /proc/*/fd/ will be filled with the event's fds. > > > But if there is a few events, I think no problem. > > > > > In our own use case this will be low due to the way we plan to use the > > events. However, I am not sure others will follow that :) > > I just concerned if qemu consider to use this interface for their event > log :) > Yep, agree. It sounds like taking an index'd approach as you first suggested is worth the complexity. I want to make sure you and Steven agree before attempting. Thanks, -Beau
On Wed, 13 Oct 2021 09:50:43 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > > Does it require RCU synchronization as the updates only happen from > > user space. But is this for the writing of the event? You want a > > separate fd for each event to write to, instead of saying you have > > another interface to write and just pass the given id? > > > Yes, an example is a process creates the fd and registers some events. > Then the process forks and the child registers another event using the > same fd that was inherited. Well, I was thinking simple locking could work too. But I guess RCU is like Batman. You know, "Always be yourself. Unless you can be Batman, then always be Batman!". So always use locking, unless you can use RCU, then always use RCU. > > If the original process writes while the child process registers at that > point the FD array can get resized / moved, therefore we need RCU deref > protection when resizing, etc. > > I have a few gauntlet tools that try to crash user_events by writing, > registering, unregistering at weird times to try to flush this stuff > out. > > > In our own use case this will be low due to the way we plan to use the > > > events. However, I am not sure others will follow that :) > > > > I will say, whenever we say this will only have a "few", if it becomes > > useful, it will end up having many. > > > > -- Steve > Agree 100%, I've gone back and forth on which is better for a while. I'm > happy to update to RCU and send out a V3. Want to make sure we have > consensus of the right approach before spinning on it :) > Sure, thanks. -- Steve
On Wed, Oct 13, 2021 at 01:11:55PM -0400, Steven Rostedt wrote: > On Wed, 13 Oct 2021 09:50:43 -0700 > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > > Does it require RCU synchronization as the updates only happen from > > > user space. But is this for the writing of the event? You want a > > > separate fd for each event to write to, instead of saying you have > > > another interface to write and just pass the given id? > > > > > Yes, an example is a process creates the fd and registers some events. > > Then the process forks and the child registers another event using the > > same fd that was inherited. > > Well, I was thinking simple locking could work too. But I guess RCU is like > Batman. You know, "Always be yourself. Unless you can be Batman, then > always be Batman!". So always use locking, unless you can use RCU, > then always use RCU. > LOL, I'm happy to use a rwlock_t instead. Not sure which is faster, to me I care most about the write path not skewing clock times of the events being emitted. It seems like the contention case will be low in most cases, so these paths will be read-only most of the time. It seems rwlock_t has the disadvantage of the writes blocking on the realloc/free case during the resize. RCU can delay the free until something has time to do so, so seems a good fit. Thoughts? Thanks, -Beau
On Wed, 13 Oct 2021 10:17:47 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > On Wed, Oct 13, 2021 at 01:11:55PM -0400, Steven Rostedt wrote: > > On Wed, 13 Oct 2021 09:50:43 -0700 > > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > > > > Does it require RCU synchronization as the updates only happen from > > > > user space. But is this for the writing of the event? You want a > > > > separate fd for each event to write to, instead of saying you have > > > > another interface to write and just pass the given id? > > > > > > > Yes, an example is a process creates the fd and registers some events. > > > Then the process forks and the child registers another event using the > > > same fd that was inherited. > > > > Well, I was thinking simple locking could work too. But I guess RCU is like > > Batman. You know, "Always be yourself. Unless you can be Batman, then > > always be Batman!". So always use locking, unless you can use RCU, > > then always use RCU. > > > LOL, I'm happy to use a rwlock_t instead. Not sure which is faster, to > me I care most about the write path not skewing clock times of the > events being emitted. It seems like the contention case will be low in > most cases, so these paths will be read-only most of the time. > > It seems rwlock_t has the disadvantage of the writes blocking on the > realloc/free case during the resize. RCU can delay the free until > something has time to do so, so seems a good fit. > > Thoughts? You can always do the allocation and free outside the rwlock_t. new_data = alloc(); lock(); update new_data with old_data ptr = new_data; unlock(); free old_data And is the preferred method, as we don't want allocation or freeing done inside the locking (especially on RT, where rwlocks are not "special"). The main concern is cache contention with the updates, even among readers. That is, readers may not block on each other, but the accessing of the same lock will cause cache contention. And writers will block. I don't remember if rwlocks are fair or not (when a writer blocks, all new readers block too.) I think it is. For RCU, it's how you free it. You can push it off to a queue, if you have a field in the data structure that can be added to the rcu link list that wont affect the readers. If you are concerned about the contention between readers, then RCU is the way to go, as it doesn't have that issue. -- Steve
On Wed, 13 Oct 2021 11:40:34 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, 14 Oct 2021 00:21:32 +0900 > Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > > This approach requires an FD returned and either an int for the status > > > page or the returend FD could expose the ID via another IOCTL being > > > issued. > > > > OK, I would like to suggest you to add events/user-events/*/marker file > > (which returns that shared file struct backed FD) so that some simple > > user scripts can also send the events (these may not use ioctl, just > > write the events.) But this can be done afterwards anyway. > > > > I'd prefer we avoid this. It will break some of the semantics of the events > directory. One, only "user-events" will have this "marker" file. Yes, it is only for the user-events. > Although > it will be very similar to the "inject" file, but that's only for debugging > anyway. Oh, do we already the "inject" file? > All the files in the events directory is starting to add a bunch of > overhead, as they are typically copied into the instances. Although, when > we get the eventfs directory created, maybe that will not be as big of a > deal. But that still doesn't fix the semantics issue. Indeed. OK, making "marker" file for each instances may confuse user because the written event itself must be delivered to any instance but the file seems to belong to one of them. Please ignore it. Thank you, > > -- Steve >
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 3ee23f4d437f..deaaad421be4 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -725,6 +725,21 @@ config SYNTH_EVENTS If in doubt, say N. +config USER_EVENTS + bool "User trace events" + select TRACING + select DYNAMIC_EVENTS + default n + help + User trace events are user-defined trace events that + can be used like an existing kernel trace event. User trace + events are generated by writing to a tracefs file. User + processes can determine if their tracing events should be + generated by memory mapping a tracefs file and checking for + an associated byte being non-zero. + + If in doubt, say N. + config HIST_TRIGGERS bool "Histogram triggers" depends on ARCH_HAVE_NMI_SAFE_CMPXCHG diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index b1c47ccf4f73..a653b255e89c 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -80,6 +80,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o obj-$(CONFIG_TRACE_EVENT_INJECT) += trace_events_inject.o obj-$(CONFIG_SYNTH_EVENTS) += trace_events_synth.o obj-$(CONFIG_HIST_TRIGGERS) += trace_events_hist.o +obj-$(CONFIG_USER_EVENTS) += trace_events_user.o obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o obj-$(CONFIG_KPROBE_EVENTS) += trace_kprobe.o obj-$(CONFIG_TRACEPOINTS) += error_report-traces.o diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c new file mode 100644 index 000000000000..9afa72b55fa8 --- /dev/null +++ b/kernel/trace/trace_events_user.c @@ -0,0 +1,845 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (c) 2021, Microsoft Corporation. + * + * Authors: + * Beau Belgrave <beaub@linux.microsoft.com> + */ + +#include <linux/bitmap.h> +#include <linux/cdev.h> +#include <linux/hashtable.h> +#include <linux/io.h> +#include <linux/ioctl.h> +#include <linux/jhash.h> +#include <linux/trace_events.h> +#include <linux/tracefs.h> +#include <linux/types.h> +#include <linux/uaccess.h> +#include "trace.h" +#include "trace_dynevent.h" + +#define USER_EVENTS_SYSTEM "user_events" +#define USER_EVENTS_PREFIX "ue:" +#define USER_EVENTS_PREFIX_LEN (sizeof(USER_EVENTS_PREFIX)-1) + +/* Bits 0-6 are for known probe types, Bit 7 is for unknown probes */ +#define EVENT_BIT_FTRACE 0 +#define EVENT_BIT_PERF 1 +#define EVENT_BIT_OTHER 7 + +#define EVENT_STATUS_FTRACE (1 << EVENT_BIT_FTRACE) +#define EVENT_STATUS_PERF (1 << EVENT_BIT_PERF) +#define EVENT_STATUS_OTHER (1 << EVENT_BIT_OTHER) + +#define FIELD_DEPTH_TYPE 0 +#define FIELD_DEPTH_NAME 1 +#define FIELD_DEPTH_SIZE 2 +#define FIELD_DEPTH_OFFSET 3 + +/* + * Limits how many trace_event calls user processes can create: + * Must be multiple of PAGE_SIZE. + */ +#define MAX_PAGES 1 +#define MAX_EVENTS (MAX_PAGES * PAGE_SIZE) + +/* Limit how long of an event name plus args within the subsystem. */ +#define MAX_EVENT_DESC 512 +#define EVENT_NAME(user_event) ((user_event)->tracepoint.name) + +#define DIAG_IOC_MAGIC '*' +#define DIAG_IOCSREG _IOW(DIAG_IOC_MAGIC, 0, char*) +#define DIAG_IOCSDEL _IOW(DIAG_IOC_MAGIC, 1, char*) +#define DIAG_IOCQLOCOFFSET _IO(DIAG_IOC_MAGIC, 2) + +static char *register_page_data; + +static DEFINE_HASHTABLE(register_table, 4); +static DECLARE_BITMAP(page_bitmap, MAX_EVENTS); + +struct user_event { + struct tracepoint tracepoint; + struct trace_event_call call; + struct trace_event_class class; + struct dyn_event devent; + struct hlist_node node; + atomic_t refs; + int index; + char *args; +}; + +#ifdef CONFIG_PERF_EVENTS +struct user_bpf_context { + int udatalen; + const char __user *udata; +}; +#endif + +typedef void (*user_event_func_t) (struct user_event *user, + const char __user *udata, + size_t udatalen, void *tpdata); + +static int register_user_event(char *name, char *args, + struct user_event **newuser); + +/* + * Parses a register command for user_events + * Format: event_name[;field1;field2;...] + * + * Example event named test with a 20 char msg field at offset 0 with a unsigned int at offset 20: + * test;char[]\tmsg\t20\t0;unsigned int\tid\t4\t20; + * + * NOTE: Offsets are from the user data perspective, they are not from the + * trace_entry/buffer perspective. We automatically add the common properties + * sizes to the offset for the user. Types of __data_loc must trace a value that + * is offset by the value of the DIAG_IOCQLOCOFFSET ioctl to decode properly. + * This makes it easy for the common cases via the terminal, as only __data_loc + * types require an awareness by the user of the common property offsets. + */ +static int user_event_parse_cmd(char *raw_command, struct user_event **newuser) +{ + char *name = raw_command; + char *args = strpbrk(name, ";"); + + if (args) + *args++ = 0; + + return register_user_event(name, args, newuser); +} + +/* + * Parses the values of a field within the description + * Format: type\tname\tsize\toffset\t[future additions\t] + */ +static int user_event_parse_field(char *field, struct user_event *user) +{ + char *part, *type, *name; + u32 size, offset; + int depth = 0; + + while ((part = strsep(&field, "\t")) != NULL) { + switch (depth++) { + case FIELD_DEPTH_TYPE: + type = part; + break; + case FIELD_DEPTH_NAME: + name = part; + break; + case FIELD_DEPTH_SIZE: + if (kstrtou32(part, 10, &size)) + return -EINVAL; + break; + case FIELD_DEPTH_OFFSET: + if (kstrtou32(part, 10, &offset)) + return -EINVAL; + /* + * User does not know what trace_entry size is + * so we have to add to the offset. For data loc + * scenarios, user mode applications must be aware + * of this size when emitting the data location. + * The DIAG_IOCQLOCOFFSET ioctl can be used to get this. + */ + offset += sizeof(struct trace_entry); + break; + default: + /* Forward compatibility, ignore */ + goto end; + } + } +end: + if (depth < FIELD_DEPTH_OFFSET) + return -EINVAL; + + if (!strcmp(type, "print_fmt")) { + user->call.print_fmt = name; + return 0; + } + + return trace_define_field(&user->call, type, name, offset, size, + type[0] != 'u', FILTER_OTHER); +} + +/* + * Parses the fields that were described for the event + */ +static int user_event_parse_fields(struct user_event *user) +{ + char *field; + int ret = -EINVAL; + + while ((field = strsep(&user->args, ";")) != NULL) { + ret = user_event_parse_field(field, user); + + if (ret) + break; + } + + return ret; +} + +static int user_event_define_fields(struct trace_event_call *call) +{ + struct user_event *user = (struct user_event *)call->data; + + /* User chose to not disclose arguments */ + if (user->args == NULL) + return 0; + + return user_event_parse_fields(user); +} + +static struct trace_event_fields user_event_fields_array[] = { + { .type = TRACE_FUNCTION_TYPE, + .define_fields = user_event_define_fields }, + {} +}; + +static enum print_line_t user_event_print_trace(struct trace_iterator *iter, + int flags, + struct trace_event *event) +{ + /* Unsafe to try to decode user provided print_fmt, use hex */ + trace_print_hex_dump_seq(&iter->seq, "", DUMP_PREFIX_OFFSET, 16, + 1, iter->ent, iter->ent_size, true); + + return trace_handle_return(&iter->seq); +} + +static struct trace_event_functions user_event_funcs = { + .trace = user_event_print_trace, +}; + +static int destroy_user_event(struct user_event *user) +{ + int ret = 0; + + /* + * trace_remove_event_call invokes unregister_trace_event: + * Pick the correct one based on if we set the data or not + */ + if (user->index != 0) { + ret = trace_remove_event_call(&user->call); + + if (ret) + return ret; + + dyn_event_remove(&user->devent); + + register_page_data[user->index] = 0; + clear_bit(user->index, page_bitmap); + hash_del(&user->node); + } else { + unregister_trace_event(&user->call.event); + } + + kfree(EVENT_NAME(user)); + kfree(user); + + return ret; +} + +static struct user_event *find_user_event(u32 key, char *name) +{ + struct user_event *user; + + hash_for_each_possible(register_table, user, node, key) + if (!strcmp(EVENT_NAME(user), name)) + return user; + + return NULL; +} + +/* + * Writes the user supplied payload out to a trace file. + */ +static void user_event_ftrace(struct user_event *user, const char __user *udata, + size_t udatalen, void *tpdata) +{ + struct trace_event_file *file; + struct trace_entry *entry; + struct trace_event_buffer event_buffer; + + file = (struct trace_event_file *)tpdata; + + if (!file || + !(file->flags & EVENT_FILE_FL_ENABLED) || + trace_trigger_soft_disabled(file)) + return; + + entry = trace_event_buffer_reserve(&event_buffer, file, + sizeof(*entry) + udatalen); + + if (!entry) + return; + + if (!copy_from_user(entry + 1, udata, udatalen)) + trace_event_buffer_commit(&event_buffer); +} + +#ifdef CONFIG_PERF_EVENTS +/* + * Writes the user supplied payload out to perf ring buffer or eBPF program. + */ +static void user_event_perf(struct user_event *user, const char __user *udata, + size_t udatalen, void *tpdata) +{ + struct hlist_head *perf_head; + + if (bpf_prog_array_valid(&user->call)) { + struct user_bpf_context context = {0}; + + context.udatalen = udatalen; + context.udata = udata; + + trace_call_bpf(&user->call, &context); + } + + perf_head = this_cpu_ptr(user->call.perf_events); + + if (perf_head && !hlist_empty(perf_head)) { + struct trace_entry *perf_entry; + struct pt_regs *regs; + size_t size = sizeof(*perf_entry) + udatalen; + int context; + + perf_entry = perf_trace_buf_alloc(ALIGN(size, 8), + ®s, &context); + + if (!perf_entry) + return; + + perf_fetch_caller_regs(regs); + + if (copy_from_user(perf_entry + 1, + udata, + udatalen)) + return; + + perf_trace_buf_submit(perf_entry, size, context, + user->call.event.type, 1, regs, + perf_head, NULL); + } +} +#endif + +/* + * Update the register page that is shared between user processes. + */ +static void update_reg_page_for(struct user_event *user) +{ + struct tracepoint *tp = &user->tracepoint; + char status = 0; + + if (atomic_read(&tp->key.enabled) > 0) { + struct tracepoint_func *probe_func_ptr; + user_event_func_t probe_func; + + probe_func_ptr = rcu_dereference_sched(tp->funcs); + + if (probe_func_ptr) { + do { + probe_func = probe_func_ptr->func; + + if (probe_func == user_event_ftrace) + status |= EVENT_STATUS_FTRACE; +#ifdef CONFIG_PERF_EVENTS + else if (probe_func == user_event_perf) + status |= EVENT_STATUS_PERF; +#endif + else + status |= EVENT_STATUS_OTHER; + } while ((++probe_func_ptr)->func); + } + } + + register_page_data[user->index] = status; +} + +/* + * Register callback for our events from tracing sub-systems. + */ +static int user_event_reg(struct trace_event_call *call, + enum trace_reg type, + void *data) +{ + struct user_event *user = (struct user_event *)call->data; + int ret = 0; + + if (!user) + return -ENOENT; + + switch (type) { + case TRACE_REG_REGISTER: + ret = tracepoint_probe_register(call->tp, + call->class->probe, + data); + if (!ret) + goto inc; + break; + + case TRACE_REG_UNREGISTER: + tracepoint_probe_unregister(call->tp, + call->class->probe, + data); + goto dec; + +#ifdef CONFIG_PERF_EVENTS + case TRACE_REG_PERF_REGISTER: + ret = tracepoint_probe_register(call->tp, + call->class->perf_probe, + data); + if (!ret) + goto inc; + break; + + case TRACE_REG_PERF_UNREGISTER: + tracepoint_probe_unregister(call->tp, + call->class->perf_probe, + data); + goto dec; + + case TRACE_REG_PERF_OPEN: + case TRACE_REG_PERF_CLOSE: + case TRACE_REG_PERF_ADD: + case TRACE_REG_PERF_DEL: + break; +#endif + } + + return ret; +inc: + atomic_inc(&user->refs); + update_reg_page_for(user); + return 0; +dec: + update_reg_page_for(user); + atomic_dec(&user->refs); + return 0; +} + +static u32 user_event_key(char *name) +{ + return jhash(name, strlen(name), 0); +} + +static int user_event_create(const char *raw_command) +{ + struct user_event *user; + char *name; + int ret; + + if (strstr(raw_command, USER_EVENTS_PREFIX) != raw_command) + return -ECANCELED; + + name = kstrdup(raw_command + USER_EVENTS_PREFIX_LEN, GFP_KERNEL); + + if (!name) + return -ENOMEM; + + mutex_lock(&event_mutex); + ret = user_event_parse_cmd(name, &user); + mutex_unlock(&event_mutex); + + return ret; +} + +static int user_event_show(struct seq_file *m, struct dyn_event *ev) +{ + struct user_event *user = container_of(ev, struct user_event, devent); + struct ftrace_event_field *field, *next; + struct list_head *head; + char status; + + seq_printf(m, "%s%s", USER_EVENTS_PREFIX, EVENT_NAME(user)); + + head = trace_get_fields(&user->call); + + list_for_each_entry_safe(field, next, head, link) + seq_printf(m, ";%s %s", field->type, field->name); + + status = register_page_data[user->index]; + + if (status != 0) { + seq_puts(m, " (Used by"); + if (status & EVENT_STATUS_FTRACE) + seq_puts(m, " ftrace"); + if (status & EVENT_STATUS_PERF) + seq_puts(m, " perf"); + if (status & EVENT_STATUS_OTHER) + seq_puts(m, " other"); + seq_puts(m, ")"); + } + + seq_puts(m, "\n"); + + return 0; +} + +static bool user_event_is_busy(struct dyn_event *ev) +{ + struct user_event *user = container_of(ev, struct user_event, devent); + + return atomic_read(&user->refs) != 0; +} + +static int user_event_free(struct dyn_event *ev) +{ + struct user_event *user = container_of(ev, struct user_event, devent); + + return destroy_user_event(user); +} + +static bool user_event_match(const char *system, const char *event, + int argc, const char **argv, struct dyn_event *ev) +{ + struct user_event *user = container_of(ev, struct user_event, devent); + + return strcmp(EVENT_NAME(user), event) == 0 && + (!system || strcmp(system, USER_EVENTS_SYSTEM) == 0); +} + +static struct dyn_event_operations user_event_dops = { + .create = user_event_create, + .show = user_event_show, + .is_busy = user_event_is_busy, + .free = user_event_free, + .match = user_event_match, +}; + +/* + * Register a trace_event into the system, either find or create. + */ +static int register_user_event(char *name, char *args, + struct user_event **newuser) +{ + int ret; + int index; + u32 key = user_event_key(name); + struct user_event *user = find_user_event(key, name); + + if (user) { + *newuser = user; + ret = 0; + goto put_name; + } + + index = find_first_zero_bit(page_bitmap, MAX_EVENTS); + + if (index == MAX_EVENTS) { + ret = -EMFILE; + goto put_name; + } + + user = kzalloc(sizeof(*user), GFP_KERNEL); + + if (!user) { + ret = -ENOMEM; + goto put_name; + } + + INIT_LIST_HEAD(&user->class.fields); + + user->tracepoint.name = name; + user->args = args; + + user->call.data = user; + user->call.class = &user->class; + user->call.name = name; + user->call.flags = TRACE_EVENT_FL_TRACEPOINT; + user->call.tp = &user->tracepoint; + user->call.event.funcs = &user_event_funcs; + + user->class.system = USER_EVENTS_SYSTEM; + user->class.fields_array = user_event_fields_array; + user->class.reg = user_event_reg; + user->class.probe = user_event_ftrace; +#ifdef CONFIG_PERF_EVENTS + user->class.perf_probe = user_event_perf; +#endif + + ret = register_trace_event(&user->call.event); + + if (!ret) { + ret = -ENODEV; + goto put_user; + } + + ret = trace_add_event_call(&user->call); + + if (ret) { + destroy_user_event(user); + goto out; + } + + user->index = index; + dyn_event_init(&user->devent, &user_event_dops); + dyn_event_add(&user->devent); + set_bit(user->index, page_bitmap); + hash_add(register_table, &user->node, key); + + *newuser = user; + return 0; +put_user: + kfree(user); +put_name: + kfree(name); +out: + return ret; +} + +/* + * Deletes a previously created event if it is no longer being used. + */ +static int delete_user_event(char *name) +{ + u32 key = user_event_key(name); + struct user_event *user = find_user_event(key, name); + + if (!user) + return -ENOENT; + + if (atomic_read(&user->refs) != 0) + return -EBUSY; + + return destroy_user_event(user); +} + +/* + * Validates the user payload and writes to the appropriate sub-system. + */ +static ssize_t user_events_write(struct file *file, const char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct user_event *user; + struct tracepoint *tp; + + if (*ppos != 0 || count <= 0) + return -EFAULT; + + user = file->private_data; + + if (!user) + return -ENOENT; + + tp = &user->tracepoint; + + if (likely(atomic_read(&tp->key.enabled) > 0)) { + struct tracepoint_func *probe_func_ptr; + user_event_func_t probe_func; + void *tpdata; + + preempt_disable(); + + if (unlikely(!(cpu_online(raw_smp_processor_id())))) + goto preempt_out; + + probe_func_ptr = rcu_dereference_sched(tp->funcs); + + if (probe_func_ptr) { + do { + probe_func = probe_func_ptr->func; + tpdata = probe_func_ptr->data; + probe_func(user, ubuf, count, tpdata); + } while ((++probe_func_ptr)->func); + } +preempt_out: + preempt_enable(); + } + + return count; +} + +/* + * Handles the ioctl from user mode to register or alter operations. + */ +static long user_events_ioctl(struct file *file, unsigned int cmd, + unsigned long uarg) +{ + void __user *ubuf = (void __user *)uarg; + struct user_event *user; + char *name; + long ret; + + switch (cmd) { + case DIAG_IOCSREG: + /* Register/lookup on behalf of user process */ + name = strndup_user(ubuf, MAX_EVENT_DESC); + + if (IS_ERR(name)) { + ret = PTR_ERR(name); + goto out; + } + + mutex_lock(&event_mutex); + + if (file->private_data) { + /* Already associated with an event */ + ret = -EMFILE; + kfree(name); + goto reg_out; + } + + ret = user_event_parse_cmd(name, &user); + + if (!ret) { + file->private_data = user; + atomic_inc(&user->refs); + } +reg_out: + mutex_unlock(&event_mutex); + + if (ret < 0) + goto out; + + /* Return page index to check before writes */ + ret = user->index; + break; + + case DIAG_IOCSDEL: + /* Delete on behalf of user process */ + name = strndup_user(ubuf, MAX_EVENT_DESC); + + if (IS_ERR(name)) { + ret = PTR_ERR(name); + goto out; + } + + mutex_lock(&event_mutex); + ret = delete_user_event(name); + mutex_unlock(&event_mutex); + + kfree(name); + break; + + case DIAG_IOCQLOCOFFSET: + /* + * Return data offset to use for data locs. This enables + * user mode processes to query the common property sizes. + * If this was not known, the data location values written + * would be incorrect from the user mode side. + */ + ret = sizeof(struct trace_entry); + break; + + default: + ret = -ENOTTY; + break; + } +out: + return ret; +} + +/* + * Handles the final close of the file from user mode. + */ +static int user_events_release(struct inode *node, struct file *file) +{ + struct user_event *user = file->private_data; + + if (user) + atomic_dec(&user->refs); + + return 0; +} + +static const struct file_operations user_events_data_fops = { + .write = user_events_write, + .unlocked_ioctl = user_events_ioctl, + .release = user_events_release, +}; + +/* + * Maps the shared page into the user process for checking if event is enabled. + */ +static int user_events_mmap(struct file *filp, struct vm_area_struct *vma) +{ + unsigned long size = vma->vm_end - vma->vm_start; + + if (size != MAX_EVENTS) + return -EINVAL; + + return remap_pfn_range(vma, vma->vm_start, + virt_to_phys(register_page_data) >> PAGE_SHIFT, + size, PAGE_READONLY); +} + +static const struct file_operations user_events_mmap_fops = { + .mmap = user_events_mmap, +}; + +/* + * Creates a set of tracefs files to allow user mode interactions. + */ +static int create_user_tracefs(void) +{ + struct dentry *edata, *emmap; + + edata = tracefs_create_file("user_events_data", 0644, NULL, + NULL, &user_events_data_fops); + + if (!edata) { + pr_warn("Could not create tracefs 'user_events_data' entry\n"); + goto err; + } + + /* mmap with MAP_SHARED requires writable fd */ + emmap = tracefs_create_file("user_events_mmap", 0644, NULL, + NULL, &user_events_mmap_fops); + + if (!emmap) { + tracefs_remove(edata); + pr_warn("Could not create tracefs 'user_events_mmap' entry\n"); + goto err; + } + + return 0; +err: + return -ENODEV; +} + +static void set_page_reservations(bool set) +{ + int page; + + for (page = 0; page < MAX_PAGES; ++page) { + void *addr = register_page_data + (PAGE_SIZE * page); + + if (set) + SetPageReserved(virt_to_page(addr)); + else + ClearPageReserved(virt_to_page(addr)); + } +} + +static int __init trace_events_user_init(void) +{ + int ret; + + /* Zero all bits beside 0 (which is reserved for failures) */ + bitmap_zero(page_bitmap, MAX_EVENTS); + set_bit(0, page_bitmap); + + register_page_data = kmalloc(MAX_EVENTS, GFP_KERNEL); + + if (!register_page_data) + return -ENOMEM; + + set_page_reservations(true); + + ret = create_user_tracefs(); + + if (ret) { + pr_warn("user_events could not register with tracefs\n"); + set_page_reservations(false); + kfree(register_page_data); + return ret; + } + + if (dyn_event_register(&user_event_dops)) + pr_warn("user_events could not register with dyn_events\n"); + + return 0; +} + +fs_initcall(trace_events_user_init);
User mode processes that wish to use trace events to get data into ftrace, perf, eBPF, etc are limited to uprobes today. The user events features enables an ABI for user mode processes to create and write to trace events that are isolated from kernel level trace events. This enables a faster path for tracing from user mode data as well as opens managed code to participate in trace events, where stub locations are dynamic. User processes often want to trace only when it's useful. To enable this a set of pages are mapped into the user process space that indicate the current state of the user events that have been registered. User processes can check if their event is hooked to a trace/probe, and if it is, emit the event data out via the write() syscall. Two files are added to tracefs to accomplish this: user_events_mmap - This file is mmap'd into participating user mode processes to indicate event status. user_events_data - This file is opened and register/delete ioctl's are issued to create/open/delete trace events that can be used for tracing. The typical scenario is on process start to mmap user_events_mmap. Processes then register the events they plan to use via the REG ioctl. The return value of the ioctl indicates the byte in the mmap to use for status. The file that was used for the ioctl is now accepting data via write() to emit out into the trace event. Psuedo code example of typical usage: page_fd = open("user_events_mmap", O_RDWR); page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0); data_fd = open("user_events_data", O_RDWR); data_id = ioctl(data_fd, DIAG_IOCSREG, "test"); if (page_data[data_id]) write(data_fd, &payload, sizeof(payload)); User events are also exposed via the dynamic_events tracefs file for both create, delete and current status. Simple example to register a user event via dynamic_events and get status: echo ue:test >> dynamic_events cat dynamic_events ue:test If an event is hooked to a probe, the probe hooked shows up: echo 1 > events/user_events/test/enable cat dynamic_events ue:test (Used by ftrace) Users can describe the trace event format via the following format: name[;field1;field2] Each field has the following format: type\tname\tsize\toffset Example for char array with a size of 20 named msg: echo -e 'ue:detailed;char[]\tmsg\t20\t0' >> dynamic_events cat dynamic_events ue:detailed;char[] msg Data offsets are based on the data written out via write() and will be updated to reflect the correct offset in the trace_event fields. __data_loc types must be aware of the size of trace_entry/common properties to ensure proper decoding. An ioctl is provided that enables user mode processes that only have access to user_events_data that returns the correct offset to use within the data payload (nothing needs to be done on registration). The above format is valid for both the ioctl and the dynamic_events file. Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> --- kernel/trace/Kconfig | 15 + kernel/trace/Makefile | 1 + kernel/trace/trace_events_user.c | 845 +++++++++++++++++++++++++++++++ 3 files changed, 861 insertions(+) create mode 100644 kernel/trace/trace_events_user.c