mbox series

[RFC,00/11] Tracefs support for pKVM

Message ID 20240805173234.3542917-1-vdonnefort@google.com (mailing list archive)
Headers show
Series Tracefs support for pKVM | expand

Message

Vincent Donnefort Aug. 5, 2024, 5:32 p.m. UTC
The growing set of features supported by the hypervisor in protected
mode necessitates debugging and profiling tools. Tracefs is the
ideal candidate for this task:

  * It is simple to use and to script.

  * It is supported by various tools, from the trace-cmd CLI to the
    Android web-based perfetto.

  * The ring-buffer, where are stored trace events consists of linked
    pages, making it an ideal structure for sharing between kernel and
    hypervisor.

This series introduces a method to create events and to generate them
from the hypervisor (hyp_enter/hyp_exit given as an example) as well as
a Tracefs user-space interface to read them.

A presentation was given on this matter during the tracing summit in
2022. [1]

1. ring-buffer
--------------

To setup the per-cpu ring-buffers, a new interface is created:

  ring_buffer_writer:	Describes what the kernel needs to know about the
			writer, that is, the set of pages forming the
			ring-buffer and a callback for the reader/head
			swapping (enables consuming read)

  ring_buffer_reader():	Creates a read-only ring-buffer from a
			ring_buffer_writer.

To keep the internals of `struct ring_buffer` in sync with the writer,
the meta-page is used. It was originally introduced to enable user-space
mapping of the ring-buffer [1]. In this case, the kernel is not the
producer anymore but the reader. The function to read that meta-page is:

  ring_buffer_poll_writer():
			Update `struct ring_buffer` based on the writer
			meta-page. Wake-up readers if necessary.

The kernel has to poll the meta-page to be notified of newly written
events.

2. Tracefs interface
--------------------

The interface is a hyp/ folder at the root of the tracefs mount point.
This folder is like an instance and you'll find there a subset of the
regular Tracefs user-space interface:

  hyp/
     buffer_size_kb
     trace_pipe
     trace_pipe_raw
     trace
     per_cpu/
             cpuX/
                 trace_pipe
                 trace_pipe_raw
     events/
            hyp/
                hyp_enter/
                          enable
                          id

Behind the scenes, kvm/hyp_trace.c must rebuild the tracing hierarchy
without relying on kernel/trace/trace.c. This is due to fundamental
differences:

  * Hypervisor tracing doesn't support trace_array's system-specific
    features (snapshots, tracers, etc.).

  * Logged event formats differ (e.g., no PID in hypervisor
    events).

  * Buffer operations require specific hypervisor interactions.

3. Events
---------

In the hypervisor, "hyp events" can be generated with trace_<event_name>
in a similar fashion to what the kernel does. They're also created with
similar macros than the kernel (see kvm_hypevents.h)

HYP_EVENT("foboar",
	HE_PROTO(void),
	HE_STRUCT(),
	HE_ASSIGN(),
	HE_PRINTK(" ")
)

Despite the apparent similarities with TRACE_EVENT(), those macros
internally differs: they must be used in parallel between the hypervisor
(for the writing part) and the kernel (for the reading part) which makes
it difficult to share anything with their kernel counterpart.

Also, events directory isn't using eventfs.

4. Few limitations:
-------------------

Non consuming reading of the buffer isn't supported (i.e. cat trace) due
to the lack of support in the ring-buffer meta-page.

Tracing must be stopped for the buffer to be reset. i.e. (echo 0 >
tracing_on; echo 0 > trace)

[1] https://tracingsummit.org/ts/2022/hypervisortracing/
[2] https://lore.kernel.org/all/20240510140435.3550353-1-vdonnefort@google.com/

Vincent Donnefort (11):
  ring-buffer: Check for empty ring-buffer with rb_num_of_entries()
  ring-buffer: Introducing ring-buffer writer
  ring-buffer: Expose buffer_data_page material
  timekeeping: Export the boot clock in snapshots
  KVM: arm64: Support unaligned fixmap in the nVHE hyp
  KVM: arm64: Add clock support in the nVHE hyp
  KVM: arm64: Add tracing support for the pKVM hyp
  KVM: arm64: Add hyp tracing to tracefs
  KVM: arm64: Add raw interface for hyp tracefs
  KVM: arm64: Add support for hyp events
  KVM: arm64: Add kselftest for tracefs hyp tracefs

 arch/arm64/include/asm/kvm_asm.h              |   6 +
 arch/arm64/include/asm/kvm_define_hypevents.h |  60 ++
 arch/arm64/include/asm/kvm_hyp.h              |   6 +
 arch/arm64/include/asm/kvm_hypevents.h        |  41 +
 arch/arm64/include/asm/kvm_hypevents_defs.h   |  41 +
 arch/arm64/include/asm/kvm_hyptrace.h         |  38 +
 arch/arm64/kernel/image-vars.h                |   4 +
 arch/arm64/kernel/vmlinux.lds.S               |  18 +
 arch/arm64/kvm/Kconfig                        |   9 +
 arch/arm64/kvm/Makefile                       |   2 +
 arch/arm64/kvm/arm.c                          |   6 +
 arch/arm64/kvm/hyp/hyp-constants.c            |   4 +
 arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h   |  13 +
 arch/arm64/kvm/hyp/include/nvhe/clock.h       |  15 +
 .../kvm/hyp/include/nvhe/define_events.h      |  21 +
 arch/arm64/kvm/hyp/include/nvhe/trace.h       |  55 ++
 arch/arm64/kvm/hyp/nvhe/Makefile              |   1 +
 arch/arm64/kvm/hyp/nvhe/clock.c               |  42 +
 arch/arm64/kvm/hyp/nvhe/events.c              |  35 +
 arch/arm64/kvm/hyp/nvhe/ffa.c                 |   2 +-
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  64 ++
 arch/arm64/kvm/hyp/nvhe/hyp.lds.S             |   4 +
 arch/arm64/kvm/hyp/nvhe/mm.c                  |   2 +-
 arch/arm64/kvm/hyp/nvhe/psci-relay.c          |  14 +-
 arch/arm64/kvm/hyp/nvhe/switch.c              |   5 +-
 arch/arm64/kvm/hyp/nvhe/trace.c               | 594 ++++++++++++
 arch/arm64/kvm/hyp_events.c                   | 164 ++++
 arch/arm64/kvm/hyp_trace.c                    | 854 ++++++++++++++++++
 arch/arm64/kvm/hyp_trace.h                    |  15 +
 include/linux/ring_buffer.h                   | 124 ++-
 include/linux/timekeeping.h                   |   6 +
 kernel/time/timekeeping.c                     |   9 +
 kernel/trace/ring_buffer.c                    | 244 +++--
 tools/testing/selftests/hyp-trace/Makefile    |   6 +
 tools/testing/selftests/hyp-trace/config      |   4 +
 .../selftests/hyp-trace/hyp-trace-test        | 161 ++++
 36 files changed, 2591 insertions(+), 98 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_define_hypevents.h
 create mode 100644 arch/arm64/include/asm/kvm_hypevents.h
 create mode 100644 arch/arm64/include/asm/kvm_hypevents_defs.h
 create mode 100644 arch/arm64/include/asm/kvm_hyptrace.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/clock.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/define_events.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/trace.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/clock.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/events.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/trace.c
 create mode 100644 arch/arm64/kvm/hyp_events.c
 create mode 100644 arch/arm64/kvm/hyp_trace.c
 create mode 100644 arch/arm64/kvm/hyp_trace.h
 create mode 100644 tools/testing/selftests/hyp-trace/Makefile
 create mode 100644 tools/testing/selftests/hyp-trace/config
 create mode 100644 tools/testing/selftests/hyp-trace/hyp-trace-test


base-commit: e4fc196f5ba36eb7b9758cf2c73df49a44199895

Comments

Steven Rostedt Aug. 6, 2024, 8:11 p.m. UTC | #1
Hi Vincent,

Thanks for sending this!

On Mon,  5 Aug 2024 18:32:23 +0100
Vincent Donnefort <vdonnefort@google.com> wrote:

> The growing set of features supported by the hypervisor in protected
> mode necessitates debugging and profiling tools. Tracefs is the
> ideal candidate for this task:
> 
>   * It is simple to use and to script.
> 
>   * It is supported by various tools, from the trace-cmd CLI to the
>     Android web-based perfetto.
> 
>   * The ring-buffer, where are stored trace events consists of linked
>     pages, making it an ideal structure for sharing between kernel and
>     hypervisor.
> 
> This series introduces a method to create events and to generate them
> from the hypervisor (hyp_enter/hyp_exit given as an example) as well as
> a Tracefs user-space interface to read them.
> 
> A presentation was given on this matter during the tracing summit in
> 2022. [1]
> 
> 1. ring-buffer
> --------------
> 
> To setup the per-cpu ring-buffers, a new interface is created:
> 
>   ring_buffer_writer:	Describes what the kernel needs to know about the
> 			writer, that is, the set of pages forming the
> 			ring-buffer and a callback for the reader/head
> 			swapping (enables consuming read)
> 
>   ring_buffer_reader():	Creates a read-only ring-buffer from a
> 			ring_buffer_writer.
> 
> To keep the internals of `struct ring_buffer` in sync with the writer,
> the meta-page is used. It was originally introduced to enable user-space
> mapping of the ring-buffer [1]. In this case, the kernel is not the
> producer anymore but the reader. The function to read that meta-page is:
> 
>   ring_buffer_poll_writer():
> 			Update `struct ring_buffer` based on the writer
> 			meta-page. Wake-up readers if necessary.
> 
> The kernel has to poll the meta-page to be notified of newly written
> events.
> 
> 2. Tracefs interface
> --------------------
> 
> The interface is a hyp/ folder at the root of the tracefs mount point.
> This folder is like an instance and you'll find there a subset of the
> regular Tracefs user-space interface:
> 
>   hyp/

Hmm, do we really need to shorten it? Why not just call it "hypervisor". I
mean tab completion helps with the typing.

>      buffer_size_kb
>      trace_pipe
>      trace_pipe_raw
>      trace
>      per_cpu/
>              cpuX/
>                  trace_pipe
>                  trace_pipe_raw
>      events/
>             hyp/
>                 hyp_enter/
>                           enable
>                           id
> 
> Behind the scenes, kvm/hyp_trace.c must rebuild the tracing hierarchy
> without relying on kernel/trace/trace.c. This is due to fundamental
> differences:
> 
>   * Hypervisor tracing doesn't support trace_array's system-specific
>     features (snapshots, tracers, etc.).
> 
>   * Logged event formats differ (e.g., no PID in hypervisor
>     events).
> 
>   * Buffer operations require specific hypervisor interactions.
> 
> 3. Events
> ---------
> 
> In the hypervisor, "hyp events" can be generated with trace_<event_name>
> in a similar fashion to what the kernel does. They're also created with
> similar macros than the kernel (see kvm_hypevents.h)
> 
> HYP_EVENT("foboar",
> 	HE_PROTO(void),
> 	HE_STRUCT(),
> 	HE_ASSIGN(),
> 	HE_PRINTK(" ")
> )
> 
> Despite the apparent similarities with TRACE_EVENT(), those macros
> internally differs: they must be used in parallel between the hypervisor
> (for the writing part) and the kernel (for the reading part) which makes
> it difficult to share anything with their kernel counterpart.
> 
> Also, events directory isn't using eventfs.
> 
> 4. Few limitations:
> -------------------
> 
> Non consuming reading of the buffer isn't supported (i.e. cat trace) due
> to the lack of support in the ring-buffer meta-page.

Hmm, I don't think it should be hard to support that. I've been looking
into it for the user mapping. But that can be added later. For now, perhaps
"cat trace" just returns -EPERM?

> 
> Tracing must be stopped for the buffer to be reset. i.e. (echo 0 >
> tracing_on; echo 0 > trace)

Hmm, why this?  I haven't looked at the patches yet, but why can't the
write to trace just stop tracing and re-enable it after the reset?

> 

-- Steve
Vincent Donnefort Aug. 7, 2024, 4:39 p.m. UTC | #2
On Tue, Aug 06, 2024 at 04:11:38PM -0400, Steven Rostedt wrote:
> 
> Hi Vincent,
> 
> Thanks for sending this!

And thanks for already having a look at it!

[..]

> > 2. Tracefs interface
> > --------------------
> > 
> > The interface is a hyp/ folder at the root of the tracefs mount point.
> > This folder is like an instance and you'll find there a subset of the
> > regular Tracefs user-space interface:
> > 
> >   hyp/
> 
> Hmm, do we really need to shorten it? Why not just call it "hypervisor". I
> mean tab completion helps with the typing.

In most of the code we do refer to hyp, that's why we kept the naming here too.
But yeah we could expand it.

> 
> >      buffer_size_kb
> >      trace_pipe
> >      trace_pipe_raw
> >      trace
> >      per_cpu/
> >              cpuX/
> >                  trace_pipe
> >                  trace_pipe_raw
> >      events/
> >             hyp/
> >                 hyp_enter/
> >                           enable
> >                           id
> > 

[...]

> > 
> > 4. Few limitations:
> > -------------------
> > 
> > Non consuming reading of the buffer isn't supported (i.e. cat trace) due
> > to the lack of support in the ring-buffer meta-page.
> 
> Hmm, I don't think it should be hard to support that. I've been looking
> into it for the user mapping. But that can be added later. For now, perhaps
> "cat trace" just returns -EPERM?

Yeah, I am sure that's something we can make work. But definitely not a priority
as it is less reliable than _pipe and unused by user-space tools I believe.

For now we print "Not supported yet". But happy to modify it to a EPERM.

> 
> > 
> > Tracing must be stopped for the buffer to be reset. i.e. (echo 0 >
> > tracing_on; echo 0 > trace)
> 
> Hmm, why this?  I haven't looked at the patches yet, but why can't the
> write to trace just stop tracing and re-enable it after the reset?

I could reset the buffers from the hypervisor with a dedicated hypercall.

However I'd still need a way to "teardown" the buffer, that is unsharing it with
the hypervisor and freeing the allocated memory. Using that reset for this
purpose was nice even though it implied to stop tracing in a first place. 

Perhaps `echo 0 > trace` could reset the buffer if tracing_on=1 and teardown the
buffers if tracing_on=0?

Alternatively, I could use `echo 0 > buffer_size_kb` for the teardown. But I
prefer the former solution: interface users are more likely to just 0 tracing_on
and trace.

> 
> > 
> 
> -- Steve