Message ID | 20220118204326.2169-1-beaub@linux.microsoft.com (mailing list archive) |
---|---|
Headers | show |
Series | user_events: Enable user processes to create and write to trace events | expand |
Hi Beau, Thanks for updating. This series looks good to me. Acked-by: Masami Hiramatsu <mhiramat@kernel.org> for this series. Regards, On Tue, 18 Jan 2022 12:43:14 -0800 Beau Belgrave <beaub@linux.microsoft.com> wrote: > User mode processes that wish to use trace events to get data into > ftrace, perf, eBPF, etc are limited to uprobes today. The user events > features enables an ABI for user mode processes to create and write to > trace events that are isolated from kernel level trace events. This > enables a faster path for tracing from user mode data as well as opens > managed code to participate in trace events, where stub locations are > dynamic. > > User processes often want to trace only when it's useful. To enable this > a set of pages are mapped into the user process space that indicate the > current state of the user events that have been registered. User > processes can check if their event is hooked to a trace/probe, and if it > is, emit the event data out via the write() syscall. > > Two new files are introduced into tracefs to accomplish this: > user_events_status - This file is mmap'd into participating user mode > processes to indicate event status. > > user_events_data - This file is opened and register/delete ioctl's are > issued to create/open/delete trace events that can be used for tracing. > > The typical scenario is on process start to mmap user_events_status. Processes > then register the events they plan to use via the REG ioctl. The ioctl reads > and updates the passed in user_reg struct. The status_index of the struct is > used to know the byte in the status page to check for that event. The > write_index of the struct is used to describe that event when writing out to > the fd that was used for the ioctl call. The data must always include this > index first when writing out data for an event. Data can be written either by > write() or by writev(). > > For example, in memory: > int index; > char data[]; > > Psuedo code example of typical usage: > struct user_reg reg; > > int page_fd = open("user_events_status", O_RDWR); > char *page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0); > close(page_fd); > > int data_fd = open("user_events_data", O_RDWR); > > reg.size = sizeof(reg); > reg.name_args = (__u64)"test"; > > ioctl(data_fd, DIAG_IOCSREG, ®); > int status_id = reg.status_index; > int write_id = reg.write_index; > > struct iovec io[2]; > io[0].iov_base = &write_id; > io[0].iov_len = sizeof(write_id); > io[1].iov_base = payload; > io[1].iov_len = sizeof(payload); > > if (page_data[status_id]) > writev(data_fd, io, 2); > > User events are also exposed via the dynamic_events tracefs file for > both create and delete. Current status is exposed via the user_events_status > tracefs file. > > Simple example to register a user event via dynamic_events: > echo u:test >> dynamic_events > cat dynamic_events > u:test > > If an event is hooked to a probe, the probe hooked shows up: > echo 1 > events/user_events/test/enable > cat user_events_status > 1:test # Used by ftrace > > Active: 1 > Busy: 1 > Max: 4096 > > If an event is not hooked to a probe, no probe status shows up: > echo 0 > events/user_events/test/enable > cat user_events_status > 1:test > > Active: 1 > Busy: 0 > Max: 4096 > > Users can describe the trace event format via the following format: > name[:FLAG1[,FLAG2...] [field1[;field2...]] > > Each field has the following format: > type name > > Example for char array with a size of 20 named msg: > echo 'u:detailed char[20] msg' >> dynamic_events > cat dynamic_events > u:detailed char[20] msg > > Data offsets are based on the data written out via write() and will be > updated to reflect the correct offset in the trace_event fields. For dynamic > data it is recommended to use the new __rel_loc data type. This type will be > the same as __data_loc, but the offset is relative to this entry. This allows > user_events to not worry about what common fields are being inserted before > the data. > > The above format is valid for both the ioctl and the dynamic_events file. > > V2: > Fixed kmalloc vs kzalloc for register_page. > Renamed user_event_mmap to user_event_status. > Renamed user_event prefix from ue to u. > Added seq_* operations to user_event_status to enable cat output. > Aligned field parsing to synth_events format (+ size specifier for > custom/user types). > Added uapi header user_events.h to align kernel and user ABI definitions. > > V3: > Updated ABI to handle single FD into many events via an int header. > Added iovec/writev support to enable int header without payload changes. > Updated bpf context to describe if data is coming from user, kernel or > raw iovec. > Added flag support for registering event, allows forcing BPF to always > recieve the direct iovecs for sensitive code paths that do not want > copies. > > V4: > Moved to struct user_reg for registering events via ioctl. > Added unit tests for ftrace, dyn_events and perf integration. > Added print_fmt generation and proper dyn_events matching statements. > Reduced time in preemption disabled paths. > Added documentation file. > Pre-fault in data when preemption is enabled and use no-fault copy in probes. > Fixed MIPs missing PAGE_READONLY define. > > V5: > Rebase to linux-trace for-next branch. > Added sample code into samples/user_events. > Switched to str_has_prefix in various locations. > Allow hex in array sizes and ensure reasonable sizes are used. > Moved lifetime of name buffer when parsing to the caller for failure paths. > Fixed documentation nits and index. > Ensure event isn't busy before freeing through dyn_events. > Properly handle failure case for ftrace and perf in fault cases for buffers. > Ensure write data is over min size and null terminated for dynamic arrays. > > V6: > Fixed endian issue with dyn loc decoding (use u32). > Fixed size_t conversion warning on hexagon arch (min vs min_t). > Handle cases for __get_str vs __get_rel_str in print_fmt generation. > Add additional comments around various event member lifetimes. > Reduced max field array size to 1K. > > V7: > Acquire reg_mutex during release, ensure refs cannot change under any situation. > Remove default n from Kconfig. > Move from static 0644 mode to TRACE_MODE_WRITE. > > V8: > Squashed UABI header into ftrace minimal patch thread. > Moved pagefault_disable/enable into copy_nofault. > Moved to strscpy vs custom copy when getting array size from type. > Made patch bisect friendly by ensuring tests are split from kernel code. > > V9: > Rebase to linux-trace ftrace/core branch. > Added comments for user_reg and other structs in user_events.h. > Moved from delayed seq_file to pre-created seq_file for status file. > Added deleting events to documentation and expanded registering section. > Reordered patches to make reviewing easier. > Fixed nitpicks. > > V10: > Fix struct size case not writing size out to dynamic_events. > Fix warning for NULL pointer arithmetic in user_seq_start. > > Beau Belgrave (12): > user_events: Add minimal support for trace_event into ftrace > user_events: Add print_fmt generation support for basic types > user_events: Handle matching arguments from dyn_events > user_events: Add basic perf and eBPF support > user_events: Optimize writing events by only copying data once > user_events: Validate user payloads for size and null termination > user_events: Add self-test for ftrace integration > user_events: Add self-test for dynamic_events integration > user_events: Add self-test for perf_event integration > user_events: Add self-test for validator boundaries > user_events: Add sample code for typical usage > user_events: Add documentation file > > Documentation/trace/index.rst | 1 + > Documentation/trace/user_events.rst | 216 +++ > include/uapi/linux/user_events.h | 116 ++ > kernel/trace/Kconfig | 14 + > kernel/trace/Makefile | 1 + > kernel/trace/trace_events_user.c | 1617 +++++++++++++++++ > samples/user_events/Makefile | 5 + > samples/user_events/example.c | 91 + > tools/testing/selftests/user_events/Makefile | 9 + > .../testing/selftests/user_events/dyn_test.c | 130 ++ > .../selftests/user_events/ftrace_test.c | 452 +++++ > .../testing/selftests/user_events/perf_test.c | 168 ++ > tools/testing/selftests/user_events/settings | 1 + > 13 files changed, 2821 insertions(+) > create mode 100644 Documentation/trace/user_events.rst > create mode 100644 include/uapi/linux/user_events.h > create mode 100644 kernel/trace/trace_events_user.c > create mode 100644 samples/user_events/Makefile > create mode 100644 samples/user_events/example.c > create mode 100644 tools/testing/selftests/user_events/Makefile > create mode 100644 tools/testing/selftests/user_events/dyn_test.c > create mode 100644 tools/testing/selftests/user_events/ftrace_test.c > create mode 100644 tools/testing/selftests/user_events/perf_test.c > create mode 100644 tools/testing/selftests/user_events/settings > > > base-commit: 85c62c8c3749eec02ba81217bdcac26867dc262e > -- > 2.17.1 >
On Wed, Jan 19, 2022 at 05:32:03PM +0900, Masami Hiramatsu wrote: > Hi Beau, > > Thanks for updating. This series looks good to me. > > Acked-by: Masami Hiramatsu <mhiramat@kernel.org> > > for this series. > > Regards, Great! Thank you for all your time on this, I appreciate it! -Beau
On Tue, 18 Jan 2022 12:43:14 -0800 Beau Belgrave <beaub@linux.microsoft.com> wrote: > User mode processes that wish to use trace events to get data into > ftrace, perf, eBPF, etc are limited to uprobes today. The user events > features enables an ABI for user mode processes to create and write to > trace events that are isolated from kernel level trace events. This > enables a faster path for tracing from user mode data as well as opens > managed code to participate in trace events, where stub locations are > dynamic. So I finished my review, and I'm currently added it to my queue that I'm running through my tests. Before I accept it though, I would really like you to send patches to linux-trace-devel@vger.kernel.org that add an API to libtracefs: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/ Something where users do not need to know about ioctls, or iovecs, etc. struct tracefs_user_event * tracefs_user_event_register(const char *name, enum tracefs_uevent_type type, char *field, ...); Where tracefs_uevent_type can be: enum tracefs_uevent_type { TRACEFS_UEVENT_END, TRACEFS_UEVENT_u8, TRACEFS_UEVENT_s8, TRACEFS_UEVENT_u16, ... }; uevent = tracefs_user_event_register("test", TRACEFS_UEVENT_u64, "count", TRACEFS_UEVENT_string, "name", TRACEFS_UEVENT_array, 16, "array", TRACEFS_UEVENT_END); and that will do the ioctl to register the event, with the given types and fields. struct tracefs_user_event_status *ustatus; ustatus = tracefs_user_event_status(); // does the mmap. Then we could also have: if (tracefs_user_event_test(ustatus, uevent)) tracefs_user_event_write(uevent, 64, "string", { 16 byte data }); The ustatus will be the mmap and the uevent will have the information to know where on the mmap to test for the event. As for the write, the types are saved, and the write function will have variable arguments defined by the tracefs_user_event_register(). I think having that interface in libtracefs, would make this easy to use for everyone. -- Steve
On Thu, Feb 10, 2022 at 11:00:21PM -0500, Steven Rostedt wrote: > On Tue, 18 Jan 2022 12:43:14 -0800 > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > User mode processes that wish to use trace events to get data into > > ftrace, perf, eBPF, etc are limited to uprobes today. The user events > > features enables an ABI for user mode processes to create and write to > > trace events that are isolated from kernel level trace events. This > > enables a faster path for tracing from user mode data as well as opens > > managed code to participate in trace events, where stub locations are > > dynamic. > > So I finished my review, and I'm currently added it to my queue that > I'm running through my tests. > > Before I accept it though, I would really like you to send patches to > linux-trace-devel@vger.kernel.org that add an API to libtracefs: > > https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/ > > Something where users do not need to know about ioctls, or iovecs, etc. > > struct tracefs_user_event * > tracefs_user_event_register(const char *name, > enum tracefs_uevent_type type, > char *field, ...); > > Where tracefs_uevent_type can be: > > enum tracefs_uevent_type { > TRACEFS_UEVENT_END, > TRACEFS_UEVENT_u8, > TRACEFS_UEVENT_s8, > TRACEFS_UEVENT_u16, > ... > }; > > uevent = tracefs_user_event_register("test", > TRACEFS_UEVENT_u64, "count", > TRACEFS_UEVENT_string, "name", > TRACEFS_UEVENT_array, 16, "array", > TRACEFS_UEVENT_END); > > and that will do the ioctl to register the event, with the given types > and fields. > > struct tracefs_user_event_status *ustatus; > > ustatus = tracefs_user_event_status(); // does the mmap. > > > Then we could also have: > > if (tracefs_user_event_test(ustatus, uevent)) > tracefs_user_event_write(uevent, 64, "string", { 16 byte data }); > > The ustatus will be the mmap and the uevent will have the information > to know where on the mmap to test for the event. > > As for the write, the types are saved, and the write function will have > variable arguments defined by the tracefs_user_event_register(). > > I think having that interface in libtracefs, would make this easy to > use for everyone. Agreed, I'll get going on this. For performance I'm likely going to be play around with the shape of the API. But it will look similar to the above. Thanks, -Beau