[RFC,2/2] KVM: arm64: export cntvoff in debugfs

Message ID	20211119102117.22304-3-nsaenzju@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=J+MM=QG=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 54DEE61B6F From: Nicolas Saenz Julienne <nsaenzju@redhat.com> To: linux-arm-kernel@lists.infradead.org, maz@kernel.org, rostedt@goodmis.org Cc: james.morse@arm.com, alexandru.elisei@arm.com, suzuki.poulose@arm.com, catalin.marinas@arm.com, will@kernel.org, nsaenzju@redhat.com, linux-kernel@vger.kernel.org, kvmarm@lists.cs.columbia.edu, mingo@redhat.com, mtosatti@redhat.com, nilal@redhat.com Subject: [RFC PATCH 2/2] KVM: arm64: export cntvoff in debugfs Date: Fri, 19 Nov 2021 11:21:18 +0100 Message-Id: <20211119102117.22304-3-nsaenzju@redhat.com> In-Reply-To: <20211119102117.22304-1-nsaenzju@redhat.com> References: <20211119102117.22304-1-nsaenzju@redhat.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	KVM: arm64: Host/Guest trace syncronization \| expand [RFC,0/2] KVM: arm64: Host/Guest trace syncronization [RFC,1/2] arm64/tracing: add cntvct based trace clock [RFC,2/2] KVM: arm64: export cntvoff in debugfs

Nicolas Saenz Julienne Nov. 19, 2021, 10:21 a.m. UTC

While using cntvct as the raw clock for tracing, it's possible to
synchronize host/guest traces just by knowing the virtual offset applied
to the guest's virtual counter.

This is also the case on x86 when TSC is available. The offset is
exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
implement the same for arm64.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
---
 arch/arm64/include/asm/kvm_host.h |  1 +
 arch/arm64/kvm/Makefile           |  2 +-
 arch/arm64/kvm/arch_timer.c       |  2 +-
 arch/arm64/kvm/debugfs.c          | 25 +++++++++++++++++++++++++
 include/kvm/arm_arch_timer.h      |  3 +++
 5 files changed, 31 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm64/kvm/debugfs.c

Marcelo Tosatti Nov. 19, 2021, 11:11 a.m. UTC | #1

On Fri, Nov 19, 2021 at 11:21:18AM +0100, Nicolas Saenz Julienne wrote:
> While using cntvct as the raw clock for tracing, it's possible to
> synchronize host/guest traces just by knowing the virtual offset applied
> to the guest's virtual counter.
> 
> This is also the case on x86 when TSC is available. The offset is
> exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> implement the same for arm64.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Hi Nicolas,

ARM:

CNTVCTSS_EL0, Counter-timer Self-Synchronized Virtual Count register
The CNTVCTSS_EL0 characteristics are:

Purpose
Holds the 64-bit virtual count value. The virtual count value is equal to the 
physical count value visible in CNTPCT_EL0 minus the virtual offset visible in CNTVOFF_EL2.
					   ^^^^^

x86:

24.6.5 Time-Stamp Counter Offset and Multiplier
The VM-execution control fields include a 64-bit TSC-offset field. If the “RDTSC exiting” control is 0 and the “use
TSC offsetting” control is 1, this field controls executions of the RDTSC and RDTSCP instructions. It also controls
executions of the RDMSR instruction that read from the IA32_TIME_STAMP_COUNTER MSR. For all of these, the
value of the TSC offset is added to the value of the time-stamp counter, and the sum is returned to guest software
			   ^^^^^
in EDX:EAX.

So it would be nice to keep the formula consistent for userspace:

GUEST_CLOCK_VAL = HOST_CLOCK_VAL + CLOCK_OFFSET

So would have to add a negative sign to the value to userspace.

Other than that, both the clock value (VCNTPCT_EL0) and the offset
(CNTVOFF_EL2) are not modified during guest execution? That is, CNTVOFF_EL2 is
written once during guest initialization.


> ---
>  arch/arm64/include/asm/kvm_host.h |  1 +
>  arch/arm64/kvm/Makefile           |  2 +-
>  arch/arm64/kvm/arch_timer.c       |  2 +-
>  arch/arm64/kvm/debugfs.c          | 25 +++++++++++++++++++++++++
>  include/kvm/arm_arch_timer.h      |  3 +++
>  5 files changed, 31 insertions(+), 2 deletions(-)
>  create mode 100644 arch/arm64/kvm/debugfs.c
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 2a5f7f38006f..130534c9079e 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -29,6 +29,7 @@
>  #include <asm/thread_info.h>
>  
>  #define __KVM_HAVE_ARCH_INTC_INITIALIZED
> +#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
>  
>  #define KVM_HALT_POLL_NS_DEFAULT 500000
>  
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 989bb5dad2c8..17be7cf770f2 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -14,7 +14,7 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
>  	 $(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
>  	 arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
>  	 inject_fault.o va_layout.o handle_exit.o \
> -	 guest.o debug.o reset.o sys_regs.o \
> +	 guest.o debug.o debugfs.o reset.o sys_regs.o \
>  	 vgic-sys-reg-v3.o fpsimd.o pmu.o \
>  	 arch_timer.o trng.o\
>  	 vgic/vgic.o vgic/vgic-init.o \
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 3df67c127489..ee69387f7fb6 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -82,7 +82,7 @@ u64 timer_get_cval(struct arch_timer_context *ctxt)
>  	}
>  }
>  
> -static u64 timer_get_offset(struct arch_timer_context *ctxt)
> +u64 timer_get_offset(struct arch_timer_context *ctxt)
>  {
>  	struct kvm_vcpu *vcpu = ctxt->vcpu;
>  
> diff --git a/arch/arm64/kvm/debugfs.c b/arch/arm64/kvm/debugfs.c
> new file mode 100644
> index 000000000000..f0f5083ea8d4
> --- /dev/null
> +++ b/arch/arm64/kvm/debugfs.c
> @@ -0,0 +1,25 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2021 Red Hat Inc.
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <linux/debugfs.h>
> +
> +#include <kvm/arm_arch_timer.h>
> +
> +static int vcpu_get_cntv_offset(void *data, u64 *val)
> +{
> +	struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
> +
> +	*val = timer_get_offset(vcpu_vtimer(vcpu));
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(vcpu_cntvoff_fops, vcpu_get_cntv_offset, NULL, "%lld\n");
> +
> +void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry)
> +{
> +	debugfs_create_file("cntvoff", 0444, debugfs_dentry, vcpu, &vcpu_cntvoff_fops);
> +}
> diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
> index 51c19381108c..de0cd9be825c 100644
> --- a/include/kvm/arm_arch_timer.h
> +++ b/include/kvm/arm_arch_timer.h
> @@ -106,4 +106,7 @@ void kvm_arm_timer_write_sysreg(struct kvm_vcpu *vcpu,
>  u32 timer_get_ctl(struct arch_timer_context *ctxt);
>  u64 timer_get_cval(struct arch_timer_context *ctxt);
>  
> +/* Nedded for debugfs */
> +u64 timer_get_offset(struct arch_timer_context *ctxt);
> +
>  #endif
> -- 
> 2.33.1
> 
>

Marc Zyngier Nov. 19, 2021, 12:17 p.m. UTC | #2

On Fri, 19 Nov 2021 10:21:18 +0000,
Nicolas Saenz Julienne <nsaenzju@redhat.com> wrote:
> 
> While using cntvct as the raw clock for tracing, it's possible to
> synchronize host/guest traces just by knowing the virtual offset applied
> to the guest's virtual counter.
> 
> This is also the case on x86 when TSC is available. The offset is
> exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> implement the same for arm64.

How does this work with NV, where the guest hypervisor is in control
of the virtual offset? How does userspace knows which vcpu to pick so
that it gets the right offset?

I also wonder why we need this when userspace already has direct
access to that information without any extra kernel support (read the
CNTVCT view of the vcpu using the ONEREG API, subtract it from the
host view of the counter, job done).

> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  1 +
>  arch/arm64/kvm/Makefile           |  2 +-
>  arch/arm64/kvm/arch_timer.c       |  2 +-
>  arch/arm64/kvm/debugfs.c          | 25 +++++++++++++++++++++++++
>  include/kvm/arm_arch_timer.h      |  3 +++
>  5 files changed, 31 insertions(+), 2 deletions(-)
>  create mode 100644 arch/arm64/kvm/debugfs.c
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 2a5f7f38006f..130534c9079e 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -29,6 +29,7 @@
>  #include <asm/thread_info.h>
>  
>  #define __KVM_HAVE_ARCH_INTC_INITIALIZED
> +#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
>  
>  #define KVM_HALT_POLL_NS_DEFAULT 500000
>  
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 989bb5dad2c8..17be7cf770f2 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -14,7 +14,7 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
>  	 $(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
>  	 arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
>  	 inject_fault.o va_layout.o handle_exit.o \
> -	 guest.o debug.o reset.o sys_regs.o \
> +	 guest.o debug.o debugfs.o reset.o sys_regs.o \
>  	 vgic-sys-reg-v3.o fpsimd.o pmu.o \
>  	 arch_timer.o trng.o\
>  	 vgic/vgic.o vgic/vgic-init.o \
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 3df67c127489..ee69387f7fb6 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -82,7 +82,7 @@ u64 timer_get_cval(struct arch_timer_context *ctxt)
>  	}
>  }
>  
> -static u64 timer_get_offset(struct arch_timer_context *ctxt)
> +u64 timer_get_offset(struct arch_timer_context *ctxt)
>  {
>  	struct kvm_vcpu *vcpu = ctxt->vcpu;
>  
> diff --git a/arch/arm64/kvm/debugfs.c b/arch/arm64/kvm/debugfs.c
> new file mode 100644
> index 000000000000..f0f5083ea8d4
> --- /dev/null
> +++ b/arch/arm64/kvm/debugfs.c
> @@ -0,0 +1,25 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2021 Red Hat Inc.
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <linux/debugfs.h>
> +
> +#include <kvm/arm_arch_timer.h>
> +
> +static int vcpu_get_cntv_offset(void *data, u64 *val)
> +{
> +	struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
> +
> +	*val = timer_get_offset(vcpu_vtimer(vcpu));
> +
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(vcpu_cntvoff_fops, vcpu_get_cntv_offset, NULL, "%lld\n");
> +
> +void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry)
> +{
> +	debugfs_create_file("cntvoff", 0444, debugfs_dentry, vcpu, &vcpu_cntvoff_fops);
> +}

This should be left in arch_timer.c until we actually need it for
multiple subsystems. When (and if) that happens, we will expose
per-subsystem debugfs initialisers instead of exposing the guts of the
timer code.

Thanks,

	M.

Marcelo Tosatti Nov. 19, 2021, 12:59 p.m. UTC | #3

On Fri, Nov 19, 2021 at 12:17:00PM +0000, Marc Zyngier wrote:
> On Fri, 19 Nov 2021 10:21:18 +0000,
> Nicolas Saenz Julienne <nsaenzju@redhat.com> wrote:
> > 
> > While using cntvct as the raw clock for tracing, it's possible to
> > synchronize host/guest traces just by knowing the virtual offset applied
> > to the guest's virtual counter.
> > 
> > This is also the case on x86 when TSC is available. The offset is
> > exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> > implement the same for arm64.
> 
> How does this work with NV, where the guest hypervisor is in control
> of the virtual offset? How does userspace knows which vcpu to pick so
> that it gets the right offset?

On x86, the offsets for different vcpus are the same due to the logic at
kvm_synchronize_tsc function:

During guest vcpu creation, when the TSC-clock values are written
in a short window of time (or the clock value is zero), the code uses
the same TSC.

This logic is problematic (since "short window of time" is a heuristic 
which can fail), and is being replaced by writing the same offset
for each vCPU:

commit 828ca89628bfcb1b8f27535025f69dd00eb55207
Author: Oliver Upton <oupton@google.com>
Date:   Thu Sep 16 18:15:38 2021 +0000

    KVM: x86: Expose TSC offset controls to userspace
    
    To date, VMM-directed TSC synchronization and migration has been a bit
    messy. KVM has some baked-in heuristics around TSC writes to infer if
    the VMM is attempting to synchronize. This is problematic, as it depends
    on host userspace writing to the guest's TSC within 1 second of the last
    write.
    
    A much cleaner approach to configuring the guest's views of the TSC is to
    simply migrate the TSC offset for every vCPU. Offsets are idempotent,
    and thus not subject to change depending on when the VMM actually
    reads/writes values from/to KVM. The VMM can then read the TSC once with
    KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
    the guest is paused.

So with that in place, the answer to 

How does userspace knows which vcpu to pick so
that it gets the right offset?

is any vcpu, since the offsets are the same.

> I also wonder why we need this when userspace already has direct
> access to that information without any extra kernel support (read the
> CNTVCT view of the vcpu using the ONEREG API, subtract it from the
> host view of the counter, job done).

If guest has access to the clock offset (between guest and host), then
in the guest:

	clockval = hostclockval - clockoffset

Adding "clockoffset" to that will retrieve the host clock.

Is that what you mean?

Marc Zyngier Nov. 19, 2021, 1:31 p.m. UTC | #4

On Fri, 19 Nov 2021 12:59:46 +0000,
Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> On Fri, Nov 19, 2021 at 12:17:00PM +0000, Marc Zyngier wrote:
> > On Fri, 19 Nov 2021 10:21:18 +0000,
> > Nicolas Saenz Julienne <nsaenzju@redhat.com> wrote:
> > > 
> > > While using cntvct as the raw clock for tracing, it's possible to
> > > synchronize host/guest traces just by knowing the virtual offset applied
> > > to the guest's virtual counter.
> > > 
> > > This is also the case on x86 when TSC is available. The offset is
> > > exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> > > implement the same for arm64.
> > 
> > How does this work with NV, where the guest hypervisor is in control
> > of the virtual offset? How does userspace knows which vcpu to pick so
> > that it gets the right offset?
> 
> On x86, the offsets for different vcpus are the same due to the logic at
> kvm_synchronize_tsc function:
> 
> During guest vcpu creation, when the TSC-clock values are written
> in a short window of time (or the clock value is zero), the code uses
> the same TSC.
> 
> This logic is problematic (since "short window of time" is a heuristic 
> which can fail), and is being replaced by writing the same offset
> for each vCPU:
> 
> commit 828ca89628bfcb1b8f27535025f69dd00eb55207
> Author: Oliver Upton <oupton@google.com>
> Date:   Thu Sep 16 18:15:38 2021 +0000
> 
>     KVM: x86: Expose TSC offset controls to userspace
>     
>     To date, VMM-directed TSC synchronization and migration has been a bit
>     messy. KVM has some baked-in heuristics around TSC writes to infer if
>     the VMM is attempting to synchronize. This is problematic, as it depends
>     on host userspace writing to the guest's TSC within 1 second of the last
>     write.
>     
>     A much cleaner approach to configuring the guest's views of the TSC is to
>     simply migrate the TSC offset for every vCPU. Offsets are idempotent,
>     and thus not subject to change depending on when the VMM actually
>     reads/writes values from/to KVM. The VMM can then read the TSC once with
>     KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
>     the guest is paused.
> 
> So with that in place, the answer to 
> 
> How does userspace knows which vcpu to pick so
> that it gets the right offset?
> 
> is any vcpu, since the offsets are the same.

As I just said above, this assertion doesn't hold true once you have
nested virt, because the offset is per-cpu, and is adjusted to mean
different things on different hypervisors (some hypervisors expose
stolen time through it, for example).

What this patch is doing is to expose a Linux-specific behaviour, and
try to derive properties from it. It really doesn't work in general.

> 
> > I also wonder why we need this when userspace already has direct
> > access to that information without any extra kernel support (read the
> > CNTVCT view of the vcpu using the ONEREG API, subtract it from the
> > host view of the counter, job done).
> 
> If guest has access to the clock offset (between guest and host), then
> in the guest:
> 
> 	clockval = hostclockval - clockoffset
> 
> Adding "clockoffset" to that will retrieve the host clock.
> 
> Is that what you mean?

No. The *VMM* (qemu, kvmtool, crosvm, insertyourfavouriteonehere) has
already access to it. Why do we need an extra interface?

	M.

Nicolas Saenz Julienne Nov. 22, 2021, 8:40 p.m. UTC | #5

Hi Marc, thanks for the review.

On Fri, 2021-11-19 at 12:17 +0000, Marc Zyngier wrote:
> On Fri, 19 Nov 2021 10:21:18 +0000,
> Nicolas Saenz Julienne <nsaenzju@redhat.com> wrote:
> > 
> > While using cntvct as the raw clock for tracing, it's possible to
> > synchronize host/guest traces just by knowing the virtual offset applied
> > to the guest's virtual counter.
> > 
> > This is also the case on x86 when TSC is available. The offset is
> > exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> > implement the same for arm64.
> 
> How does this work with NV, where the guest hypervisor is in control
> of the virtual offset? 

TBH I handn't thought about NV. Looking at it from that angle, I now see my
approach doesn't work on hosts that use CNTVCT (regardless of NV). Upon
entering into a guest, we change CNTVOFF before the host is done with tracing,
so traces like 'kvm_entry' will have weird timestamps. I was just lucky that
the hosts I was testing with use CNTPCT.

I believe the solution would be to be able to force a 0 offset between
guest/host. With that in mind, is there a reason why kvm_timer_vcpu_init()
imposes a non-zero one by default? I checked out the commits that introduced
that code, but couldn't find a compelling reason. VMMs can always change it
through KVM_REG_ARM_TIMER_CNT afterwards.

> I also wonder why we need this when userspace already has direct access to
> that information without any extra kernel support (read the CNTVCT view of
> the vcpu using the ONEREG API, subtract it from the host view of the counter,
> job done).

Well IIUC, you're at the mercy of how long it takes to return from the ONEREG
ioctl. The results will be skewed. For some workloads, where low latency is
key, we really need high precision traces in the order of single digit us or
even 100s of ns. I'm not sure you'll be able to get there with that approach.

[...]

> > diff --git a/arch/arm64/kvm/debugfs.c b/arch/arm64/kvm/debugfs.c
> > new file mode 100644
> > index 000000000000..f0f5083ea8d4
> > --- /dev/null
> > +++ b/arch/arm64/kvm/debugfs.c
> > @@ -0,0 +1,25 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2021 Red Hat Inc.
> > + */
> > +
> > +#include <linux/kvm_host.h>
> > +#include <linux/debugfs.h>
> > +
> > +#include <kvm/arm_arch_timer.h>
> > +
> > +static int vcpu_get_cntv_offset(void *data, u64 *val)
> > +{
> > +	struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
> > +
> > +	*val = timer_get_offset(vcpu_vtimer(vcpu));
> > +
> > +	return 0;
> > +}
> > +
> > +DEFINE_SIMPLE_ATTRIBUTE(vcpu_cntvoff_fops, vcpu_get_cntv_offset, NULL, "%lld\n");
> > +
> > +void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry)
> > +{
> > +	debugfs_create_file("cntvoff", 0444, debugfs_dentry, vcpu, &vcpu_cntvoff_fops);
> > +}
> 
> This should be left in arch_timer.c until we actually need it for
> multiple subsystems. When (and if) that happens, we will expose
> per-subsystem debugfs initialisers instead of exposing the guts of the
> timer code.

Noted.

Marc Zyngier Nov. 23, 2021, 11:09 a.m. UTC | #6

On Mon, 22 Nov 2021 20:40:52 +0000,
Nicolas Saenz Julienne <nsaenzju@redhat.com> wrote:
> 
> Hi Marc, thanks for the review.
> 
> On Fri, 2021-11-19 at 12:17 +0000, Marc Zyngier wrote:
> > On Fri, 19 Nov 2021 10:21:18 +0000,
> > Nicolas Saenz Julienne <nsaenzju@redhat.com> wrote:
> > > 
> > > While using cntvct as the raw clock for tracing, it's possible to
> > > synchronize host/guest traces just by knowing the virtual offset applied
> > > to the guest's virtual counter.
> > > 
> > > This is also the case on x86 when TSC is available. The offset is
> > > exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> > > implement the same for arm64.
> > 
> > How does this work with NV, where the guest hypervisor is in control
> > of the virtual offset? 
> 
> TBH I handn't thought about NV. Looking at it from that angle, I now see my
> approach doesn't work on hosts that use CNTVCT (regardless of NV). Upon
> entering into a guest, we change CNTVOFF before the host is done with tracing,
> so traces like 'kvm_entry' will have weird timestamps. I was just lucky that
> the hosts I was testing with use CNTPCT.

There are multiple things at play here:

- if the system is a host, the kernel will use CNTPCT. Userspace will
  still use CNTVCT, and the offset is guaranteed to be 0 *when running
  userspace*.

- if the system isn't a host (which doesn't necessarily means a
  guest), CNTVCT is the only thing that is being used, and the offset
  is unknown (Linux requires it to be constant across vcpus though).

So I doubt you'd get a bad timestamp on the host. It is just that you
have named your trace clock incorrectly (and Steven's idea of an
indirected clock could help here).

> I believe the solution would be to be able to force a 0 offset between
> guest/host. With that in mind, is there a reason why kvm_timer_vcpu_init()
> imposes a non-zero one by default? I checked out the commits that introduced
> that code, but couldn't find a compelling reason. VMMs can always change it
> through KVM_REG_ARM_TIMER_CNT afterwards.

We want to minimise the chance for an observable rollover of the
virtual counter, so time starts at 0 *in the guest*. The VMM can
change the view of that time for the purpose of migration.

If you want a 0 offset, set the counter to the physical value in the
VMM (imprecise) or have a look at Oliver Upton's patches that were
allowing an offset to be specified directly. But migration, by
definition, breaks this.

> 
> > I also wonder why we need this when userspace already has direct access to
> > that information without any extra kernel support (read the CNTVCT view of
> > the vcpu using the ONEREG API, subtract it from the host view of the counter,
> > job done).
> 
> Well IIUC, you're at the mercy of how long it takes to return from the ONEREG
> ioctl. The results will be skewed. For some workloads, where low latency is
> key, we really need high precision traces in the order of single digit us or
> even 100s of ns. I'm not sure you'll be able to get there with that approach.

The PTP clock does exactly that from the guest PoV, with a lot more
overhead, and this results in single digit ns precision. Why isn't
that possible from userspace?

Thanks,

	M.

Marcelo Tosatti Nov. 29, 2021, 12:47 p.m. UTC | #7

On Mon, Nov 22, 2021 at 09:40:52PM +0100, Nicolas Saenz Julienne wrote:
> Hi Marc, thanks for the review.
> 
> On Fri, 2021-11-19 at 12:17 +0000, Marc Zyngier wrote:
> > On Fri, 19 Nov 2021 10:21:18 +0000,
> > Nicolas Saenz Julienne <nsaenzju@redhat.com> wrote:
> > > 
> > > While using cntvct as the raw clock for tracing, it's possible to
> > > synchronize host/guest traces just by knowing the virtual offset applied
> > > to the guest's virtual counter.
> > > 
> > > This is also the case on x86 when TSC is available. The offset is
> > > exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> > > implement the same for arm64.
> > 
> > How does this work with NV, where the guest hypervisor is in control
> > of the virtual offset? 
> 
> TBH I handn't thought about NV. Looking at it from that angle, I now see my
> approach doesn't work on hosts that use CNTVCT (regardless of NV). Upon
> entering into a guest, we change CNTVOFF before the host is done with tracing,
> so traces like 'kvm_entry' will have weird timestamps. I was just lucky that
> the hosts I was testing with use CNTPCT.
> 
> I believe the solution would be to be able to force a 0 offset between
> guest/host. With that in mind, is there a reason why kvm_timer_vcpu_init()
> imposes a non-zero one by default? I checked out the commits that introduced
> that code, but couldn't find a compelling reason. VMMs can always change it
> through KVM_REG_ARM_TIMER_CNT afterwards.

One reason is that you leak information from host to guest (the hosts
TSC value).

Another reason would be that you introduce a configuration which is 
different from the what the hardware has, which can in theory trigger
guest bugs.

> > I also wonder why we need this when userspace already has direct access to
> > that information without any extra kernel support (read the CNTVCT view of
> > the vcpu using the ONEREG API, subtract it from the host view of the counter,
> > job done).
> 
> Well IIUC, you're at the mercy of how long it takes to return from the ONEREG
> ioctl. The results will be skewed. For some workloads, where low latency is
> key, we really need high precision traces in the order of single digit us or
> even 100s of ns. I'm not sure you'll be able to get there with that approach.

If the guest can read the host to guest HW clock offset already, it
could directly do the conversion.

> [...]
> 
> > > diff --git a/arch/arm64/kvm/debugfs.c b/arch/arm64/kvm/debugfs.c
> > > new file mode 100644
> > > index 000000000000..f0f5083ea8d4
> > > --- /dev/null
> > > +++ b/arch/arm64/kvm/debugfs.c
> > > @@ -0,0 +1,25 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * Copyright (C) 2021 Red Hat Inc.
> > > + */
> > > +
> > > +#include <linux/kvm_host.h>
> > > +#include <linux/debugfs.h>
> > > +
> > > +#include <kvm/arm_arch_timer.h>
> > > +
> > > +static int vcpu_get_cntv_offset(void *data, u64 *val)
> > > +{
> > > +	struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
> > > +
> > > +	*val = timer_get_offset(vcpu_vtimer(vcpu));
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +DEFINE_SIMPLE_ATTRIBUTE(vcpu_cntvoff_fops, vcpu_get_cntv_offset, NULL, "%lld\n");
> > > +
> > > +void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry)
> > > +{
> > > +	debugfs_create_file("cntvoff", 0444, debugfs_dentry, vcpu, &vcpu_cntvoff_fops);
> > > +}
> > 
> > This should be left in arch_timer.c until we actually need it for
> > multiple subsystems. When (and if) that happens, we will expose
> > per-subsystem debugfs initialisers instead of exposing the guts of the
> > timer code.
> 
> Noted.
> 
> -- 
> Nicolás Sáenz
> 
>

[RFC,2/2] KVM: arm64: export cntvoff in debugfs

Commit Message

Comments

Patch