diff mbox series

[v8,09/12] user_events: Add documentation file

Message ID 20211216173511.10390-10-beaub@linux.microsoft.com (mailing list archive)
State Superseded
Headers show
Series user_events: Enable user processes to create and write to trace events | expand

Commit Message

Beau Belgrave Dec. 16, 2021, 5:35 p.m. UTC
Add a documentation file about user_events with example code, etc.
explaining how it may be used.

Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
---
 Documentation/trace/index.rst       |   1 +
 Documentation/trace/user_events.rst | 195 ++++++++++++++++++++++++++++
 2 files changed, 196 insertions(+)
 create mode 100644 Documentation/trace/user_events.rst

Comments

Masami Hiramatsu (Google) Dec. 22, 2021, 2:18 p.m. UTC | #1
Hi Beau,

On Thu, 16 Dec 2021 09:35:08 -0800
Beau Belgrave <beaub@linux.microsoft.com> wrote:

> Add a documentation file about user_events with example code, etc.
> explaining how it may be used.
> 
> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
> ---
>  Documentation/trace/index.rst       |   1 +
>  Documentation/trace/user_events.rst | 195 ++++++++++++++++++++++++++++
>  2 files changed, 196 insertions(+)
>  create mode 100644 Documentation/trace/user_events.rst
> 
> diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
> index 3769b9b7aed8..3a47aa8341c6 100644
> --- a/Documentation/trace/index.rst
> +++ b/Documentation/trace/index.rst
> @@ -30,3 +30,4 @@ Linux Tracing Technologies
>     stm
>     sys-t
>     coresight/index
> +   user_events
> diff --git a/Documentation/trace/user_events.rst b/Documentation/trace/user_events.rst
> new file mode 100644
> index 000000000000..36104b537476
> --- /dev/null
> +++ b/Documentation/trace/user_events.rst
> @@ -0,0 +1,195 @@
> +=========================================
> +user_events: User-based Event Tracing
> +=========================================
> +
> +:Author: Beau Belgrave
> +
> +Overview
> +--------
> +User based trace events allow user processes to create events and trace data
> +that can be viewed via existing tools, such as ftrace, perf and eBPF.
> +To enable this feature, build your kernel with CONFIG_USER_EVENTS=y.
> +
> +Programs can view status of the events via
> +/sys/kernel/debug/tracing/user_events_status and can both register and write
> +data out via /sys/kernel/debug/tracing/user_events_data.
> +
> +Programs can also use /sys/kernel/debug/tracing/dynamic_events to register and
> +delete user based events via the u: prefix. The format of the command to
> +dynamic_events is the same as the ioctl with the u: prefix applied.
> +
> +Typically programs will register a set of events that they wish to expose to
> +tools that can read trace_events (such as ftrace and perf). The registration
> +process gives back two ints to the program for each event. The first int is the
> +status index. This index describes which byte in the
> +/sys/kernel/debug/tracing/user_events_status file represents this event. The
> +second int is the write index. This index describes the data when a write() or
> +writev() is called on the /sys/kernel/debug/tracing/user_events_data file.
> +
> +The structures referenced in this document are contained with the
> +/include/uap/linux/user_events.h file in the source tree.
> +
> +**NOTE:** *Both user_events_status and user_events_data are under the tracefs
> +filesystem and may be mounted at different paths than above.*
> +
> +Registering
> +-----------
> +Registering within a user process is done via ioctl() out to the
> +/sys/kernel/debug/tracing/user_events_data file. The command to issue is
> +DIAG_IOCSREG. This command takes a struct user_reg as an argument.
> +

Could you add the user_reg data structure here?

> +The struct user_reg requires two values, the first is the size of the structure
> +to ensure forward and backward compatibility. The second is the command string
> +to issue for registering.

This explanation may be a bit out of date? 
user_reg has 4 fields. 2 for input, 2 for output.

And could you add a section for DIAG_IOCSDEL?

> +
> +User based events show up under tracefs like any other event under the
> +subsystem named "user_events". This means tools that wish to attach to the
> +events need to use /sys/kernel/debug/tracing/events/user_events/[name]/enable
> +or perf record -e user_events:[name] when attaching/recording.
> +
> +**NOTE:** *The write_index returned is only valid for the FD that was used*
> +
> +Command Format
> +^^^^^^^^^^^^^^
> +The command string format is as follows::
> +
> +  name[:FLAG1[,FLAG2...]] [Field1[;Field2...]]
> +
> +Supported Flags
> +^^^^^^^^^^^^^^^
> +**BPF_ITER** - EBPF programs attached to this event will get the raw iovec
> +struct instead of any data copies for max performance.
> +
> +Field Format
> +^^^^^^^^^^^^
> +::
> +
> +  type name [size]
> +
> +Basic types are supported (__data_loc, u32, u64, int, char, char[20], etc).
> +User programs are encouraged to use clearly sized types like u32.
> +
> +**NOTE:** *Long is not supported since size can vary between user and kernel.*
> +
> +The size is only valid for types that start with a struct prefix.
> +This allows user programs to describe custom structs out to tools, if required.
> +
> +For example, a struct in C that looks like this::
> +
> +  struct mytype {
> +    char data[20];
> +  };
> +
> +Would be represented by the following field::
> +
> +  struct mytype myname 20
> +
> +Status
> +------
> +When tools attach/record user based events the status of the event is updated
> +in realtime. This allows user programs to only incur the cost of the write() or
> +writev() calls when something is actively attached to the event.
> +
> +User programs call mmap() on /sys/kernel/debug/tracing/user_events_status to
> +check the status for each event that is registered. The byte to check in the
> +file is given back after the register ioctl() via user_reg.status_index.
> +Currently the size of user_events_status is a single page, however, custom
> +kernel configurations can change this size to allow more user based events. In
> +all cases the size of the file is a multiple of a page size.
> +
> +For example, if the register ioctl() gives back a status_index of 3 you would
> +check byte 3 of the returned mmap data to see if anything is attached to that
> +event.
> +
> +Administrators can easily check the status of all registered events by reading
> +the user_events_status file directly via a terminal. The output is as follows::
> +
> +  Byte:Name [# Comments]
> +  ...
> +
> +  Active: ActiveCount
> +  Busy: BusyCount
> +  Max: MaxCount
> +
> +For example, on a system that has a single event the output looks like this::
> +
> +  1:test
> +
> +  Active: 1
> +  Busy: 0
> +  Max: 4096
> +
> +If a user enables the user event via ftrace, the output would change to this::
> +
> +  1:test # Used by ftrace
> +
> +  Active: 1
> +  Busy: 1
> +  Max: 4096
> +
> +**NOTE:** *A status index of 0 will never be returned. This allows user
> +programs to have an index that can be used on error cases.*
> +
> +Status Bits
> +^^^^^^^^^^^
> +The byte being checked will be non-zero if anything is attached. Programs can
> +check specific bits in the byte to see what mechanism has been attached.
> +
> +The following values are defined to aid in checking what has been attached:
> +
> +**EVENT_STATUS_FTRACE** - Bit set if ftrace has been attached (Bit 0).
> +
> +**EVENT_STATUS_PERF** - Bit set if perf/eBPF has been attached (Bit 1).
> +
> +Writing Data
> +------------
> +After registering an event the same fd that was used to register can be used
> +to write an entry for that event. The write_index returned must be at the start
> +of the data, then the remaining data is treated as the payload of the event.
> +
> +For example, if write_index returned was 1 and I wanted to write out an int
> +payload of the event. Then the data would have to be 8 bytes (2 ints) in size,
> +with the first 4 bytes being equal to 1 and the last 4 bytes being equal to the
> +value I want as the payload.
> +
> +In memory this would look like this::
> +
> +  int index;
> +  int payload;
> +
> +User programs might have well known structs that they wish to use to emit out
> +as payloads. In those cases writev() can be used, with the first vector being
> +the index and the following vector(s) being the actual event payload.
> +
> +For example, if I have a struct like this::
> +
> +  struct payload {
> +        int src;
> +        int dst;
> +        int flags;
> +  };
> +
> +It's advised for user programs to do the following::
> +
> +  struct iovec io[2];
> +  struct payload e;
> +
> +  io[0].iov_base = &write_index;
> +  io[0].iov_len = sizeof(write_index);
> +  io[1].iov_base = &e;
> +  io[1].iov_len = sizeof(e);
> +
> +  writev(fd, (const struct iovec*)io, 2);
> +
> +**NOTE:** *The write_index is not emitted out into the trace being recorded.*
> +
> +EBPF
> +----
> +EBPF programs that attach to a user-based event tracepoint are given a pointer
> +to a struct user_bpf_context. The bpf context contains the data type (which can
> +be a user or kernel buffer, or can be a pointer to the iovec) and the data
> +length that was emitted (minus the write_index).
> +
> +Example Code
> +------------
> +See sample code in samples/user_events.

Maybe tools/testing/selftests/user_events ?

Thank you,


> -- 
> 2.17.1
>
Beau Belgrave Jan. 3, 2022, 11:01 p.m. UTC | #2
On Wed, Dec 22, 2021 at 11:18:34PM +0900, Masami Hiramatsu wrote:
> Hi Beau,
> 
> On Thu, 16 Dec 2021 09:35:08 -0800
> Beau Belgrave <beaub@linux.microsoft.com> wrote:
> 
> > Add a documentation file about user_events with example code, etc.
> > explaining how it may be used.
> > 
> > Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
> > ---
> >  Documentation/trace/index.rst       |   1 +
> >  Documentation/trace/user_events.rst | 195 ++++++++++++++++++++++++++++
> >  2 files changed, 196 insertions(+)
> >  create mode 100644 Documentation/trace/user_events.rst
> > 
> > diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
> > index 3769b9b7aed8..3a47aa8341c6 100644
> > --- a/Documentation/trace/index.rst
> > +++ b/Documentation/trace/index.rst
> > @@ -30,3 +30,4 @@ Linux Tracing Technologies
> >     stm
> >     sys-t
> >     coresight/index
> > +   user_events
> > diff --git a/Documentation/trace/user_events.rst b/Documentation/trace/user_events.rst
> > new file mode 100644
> > index 000000000000..36104b537476
> > --- /dev/null
> > +++ b/Documentation/trace/user_events.rst
> > @@ -0,0 +1,195 @@
> > +=========================================
> > +user_events: User-based Event Tracing
> > +=========================================

[..]

> > +Registering
> > +-----------
> > +Registering within a user process is done via ioctl() out to the
> > +/sys/kernel/debug/tracing/user_events_data file. The command to issue is
> > +DIAG_IOCSREG. This command takes a struct user_reg as an argument.
> > +
> 
> Could you add the user_reg data structure here?
> 

Sure thing.

> > +The struct user_reg requires two values, the first is the size of the structure
> > +to ensure forward and backward compatibility. The second is the command string
> > +to issue for registering.
> 
> This explanation may be a bit out of date? 
> user_reg has 4 fields. 2 for input, 2 for output.
> 

Yeah, it only requires 2 inputs to work. I'll try to make this clearer.

> And could you add a section for DIAG_IOCSDEL?
> 

Sure thing.

> > +
> > +User based events show up under tracefs like any other event under the
> > +subsystem named "user_events". This means tools that wish to attach to the
> > +events need to use /sys/kernel/debug/tracing/events/user_events/[name]/enable
> > +or perf record -e user_events:[name] when attaching/recording.
> > +
> > +**NOTE:** *The write_index returned is only valid for the FD that was used*
> > +

[..]

> > +Example Code
> > +------------
> > +See sample code in samples/user_events.
> 
> Maybe tools/testing/selftests/user_events ?
> 

Previously I was asked to put the sample code in samples/user_events, so
it lives there.

Thanks,
-Beau
Steven Rostedt Jan. 6, 2022, 9:14 p.m. UTC | #3
On Mon, 3 Jan 2022 15:01:39 -0800
Beau Belgrave <beaub@linux.microsoft.com> wrote:

> > > +Example Code
> > > +------------
> > > +See sample code in samples/user_events.  
> > 
> > Maybe tools/testing/selftests/user_events ?
> >   
> 
> Previously I was asked to put the sample code in samples/user_events, so
> it lives there.

Yes, please keep the sample code in samples. Even if we have duplicate code
in selftests, that should not be used for sample code.

-- Steve
diff mbox series

Patch

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 3769b9b7aed8..3a47aa8341c6 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -30,3 +30,4 @@  Linux Tracing Technologies
    stm
    sys-t
    coresight/index
+   user_events
diff --git a/Documentation/trace/user_events.rst b/Documentation/trace/user_events.rst
new file mode 100644
index 000000000000..36104b537476
--- /dev/null
+++ b/Documentation/trace/user_events.rst
@@ -0,0 +1,195 @@ 
+=========================================
+user_events: User-based Event Tracing
+=========================================
+
+:Author: Beau Belgrave
+
+Overview
+--------
+User based trace events allow user processes to create events and trace data
+that can be viewed via existing tools, such as ftrace, perf and eBPF.
+To enable this feature, build your kernel with CONFIG_USER_EVENTS=y.
+
+Programs can view status of the events via
+/sys/kernel/debug/tracing/user_events_status and can both register and write
+data out via /sys/kernel/debug/tracing/user_events_data.
+
+Programs can also use /sys/kernel/debug/tracing/dynamic_events to register and
+delete user based events via the u: prefix. The format of the command to
+dynamic_events is the same as the ioctl with the u: prefix applied.
+
+Typically programs will register a set of events that they wish to expose to
+tools that can read trace_events (such as ftrace and perf). The registration
+process gives back two ints to the program for each event. The first int is the
+status index. This index describes which byte in the
+/sys/kernel/debug/tracing/user_events_status file represents this event. The
+second int is the write index. This index describes the data when a write() or
+writev() is called on the /sys/kernel/debug/tracing/user_events_data file.
+
+The structures referenced in this document are contained with the
+/include/uap/linux/user_events.h file in the source tree.
+
+**NOTE:** *Both user_events_status and user_events_data are under the tracefs
+filesystem and may be mounted at different paths than above.*
+
+Registering
+-----------
+Registering within a user process is done via ioctl() out to the
+/sys/kernel/debug/tracing/user_events_data file. The command to issue is
+DIAG_IOCSREG. This command takes a struct user_reg as an argument.
+
+The struct user_reg requires two values, the first is the size of the structure
+to ensure forward and backward compatibility. The second is the command string
+to issue for registering.
+
+User based events show up under tracefs like any other event under the
+subsystem named "user_events". This means tools that wish to attach to the
+events need to use /sys/kernel/debug/tracing/events/user_events/[name]/enable
+or perf record -e user_events:[name] when attaching/recording.
+
+**NOTE:** *The write_index returned is only valid for the FD that was used*
+
+Command Format
+^^^^^^^^^^^^^^
+The command string format is as follows::
+
+  name[:FLAG1[,FLAG2...]] [Field1[;Field2...]]
+
+Supported Flags
+^^^^^^^^^^^^^^^
+**BPF_ITER** - EBPF programs attached to this event will get the raw iovec
+struct instead of any data copies for max performance.
+
+Field Format
+^^^^^^^^^^^^
+::
+
+  type name [size]
+
+Basic types are supported (__data_loc, u32, u64, int, char, char[20], etc).
+User programs are encouraged to use clearly sized types like u32.
+
+**NOTE:** *Long is not supported since size can vary between user and kernel.*
+
+The size is only valid for types that start with a struct prefix.
+This allows user programs to describe custom structs out to tools, if required.
+
+For example, a struct in C that looks like this::
+
+  struct mytype {
+    char data[20];
+  };
+
+Would be represented by the following field::
+
+  struct mytype myname 20
+
+Status
+------
+When tools attach/record user based events the status of the event is updated
+in realtime. This allows user programs to only incur the cost of the write() or
+writev() calls when something is actively attached to the event.
+
+User programs call mmap() on /sys/kernel/debug/tracing/user_events_status to
+check the status for each event that is registered. The byte to check in the
+file is given back after the register ioctl() via user_reg.status_index.
+Currently the size of user_events_status is a single page, however, custom
+kernel configurations can change this size to allow more user based events. In
+all cases the size of the file is a multiple of a page size.
+
+For example, if the register ioctl() gives back a status_index of 3 you would
+check byte 3 of the returned mmap data to see if anything is attached to that
+event.
+
+Administrators can easily check the status of all registered events by reading
+the user_events_status file directly via a terminal. The output is as follows::
+
+  Byte:Name [# Comments]
+  ...
+
+  Active: ActiveCount
+  Busy: BusyCount
+  Max: MaxCount
+
+For example, on a system that has a single event the output looks like this::
+
+  1:test
+
+  Active: 1
+  Busy: 0
+  Max: 4096
+
+If a user enables the user event via ftrace, the output would change to this::
+
+  1:test # Used by ftrace
+
+  Active: 1
+  Busy: 1
+  Max: 4096
+
+**NOTE:** *A status index of 0 will never be returned. This allows user
+programs to have an index that can be used on error cases.*
+
+Status Bits
+^^^^^^^^^^^
+The byte being checked will be non-zero if anything is attached. Programs can
+check specific bits in the byte to see what mechanism has been attached.
+
+The following values are defined to aid in checking what has been attached:
+
+**EVENT_STATUS_FTRACE** - Bit set if ftrace has been attached (Bit 0).
+
+**EVENT_STATUS_PERF** - Bit set if perf/eBPF has been attached (Bit 1).
+
+Writing Data
+------------
+After registering an event the same fd that was used to register can be used
+to write an entry for that event. The write_index returned must be at the start
+of the data, then the remaining data is treated as the payload of the event.
+
+For example, if write_index returned was 1 and I wanted to write out an int
+payload of the event. Then the data would have to be 8 bytes (2 ints) in size,
+with the first 4 bytes being equal to 1 and the last 4 bytes being equal to the
+value I want as the payload.
+
+In memory this would look like this::
+
+  int index;
+  int payload;
+
+User programs might have well known structs that they wish to use to emit out
+as payloads. In those cases writev() can be used, with the first vector being
+the index and the following vector(s) being the actual event payload.
+
+For example, if I have a struct like this::
+
+  struct payload {
+        int src;
+        int dst;
+        int flags;
+  };
+
+It's advised for user programs to do the following::
+
+  struct iovec io[2];
+  struct payload e;
+
+  io[0].iov_base = &write_index;
+  io[0].iov_len = sizeof(write_index);
+  io[1].iov_base = &e;
+  io[1].iov_len = sizeof(e);
+
+  writev(fd, (const struct iovec*)io, 2);
+
+**NOTE:** *The write_index is not emitted out into the trace being recorded.*
+
+EBPF
+----
+EBPF programs that attach to a user-based event tracepoint are given a pointer
+to a struct user_bpf_context. The bpf context contains the data type (which can
+be a user or kernel buffer, or can be a pointer to the iovec) and the data
+length that was emitted (minus the write_index).
+
+Example Code
+------------
+See sample code in samples/user_events.