mbox series

[00/10] KVM: Add idempotent controls for migrating system counter state

Message ID 20210608214742.1897483-1-oupton@google.com (mailing list archive)
Headers show
Series KVM: Add idempotent controls for migrating system counter state | expand

Message

Oliver Upton June 8, 2021, 9:47 p.m. UTC
KVM's current means of saving/restoring system counters is plagued with
temporal issues. At least on ARM64 and x86, we migrate the guest's
system counter by-value through the respective guest system register
values (cntvct_el0, ia32_tsc). Restoring system counters by-value is
brittle as the state is not idempotent: the host system counter is still
oscillating between the attempted save and restore. Furthermore, VMMs
may wish to transparently live migrate guest VMs, meaning that they
include the elapsed time due to live migration blackout in the guest
system counter view. The VMM thread could be preempted for any number of
reasons (scheduler, L0 hypervisor under nested) between the time that
it calculates the desired guest counter value and when KVM actually sets
this counter state.

Despite the value-based interface that we present to userspace, KVM
actually has idempotent guest controls by way of system counter offsets.
We can avoid all of the issues associated with a value-based interface
by abstracting these offset controls in new ioctls. This series
introduces KVM_{GET,SET}_SYSTEM_COUNTER_STATE ioctls, meant to provide
userspace with idempotent controls of the guest system counter.

Patch 1 defines the ioctls, and was separated from the two provided
implementations for the sake of review. If it is more intuitive, this
patch can be squashed into the implementation commit.

Patch 2 realizes initial support for ARM64, migrating only the state
associated with the guest's virtual counter-timer. Patch 3 introduces a
KVM selftest to assert that userspace manipulation via the
aforementioned ioctls produces the expected system counter values within
the guest.

Patch 4 extends upon the ARM64 implementation by adding support for
physical counter-timer offsetting. This is currently backed by a
trap-and-emulate implementation, but can also be virtualized in hardware
that fully implements ARMv8.6-ECV. ECV support has been elided from this
series out of convenience for the author :) Patch 5 adds some test cases
to the newly-minted kvm selftest to validate expectations of physical
counter-timer emulation.

Patch 6 introduces yet another KVM selftest for aarch64, intended to
measure the effects of physical counter-timer emulation. Data for this
test can be found below, but basically there is some tradeoff of
overhead for the sake of correctness, but it isn't too bad.

Patches 7-8 add support for the ioctls to x86 by shoehorning the
controls into the pre-existing synchronization heuristics. Patch 7
provides necessary helper methods for the implementation to play nice
with those heuristics, and patch 8 actually implements the ioctls.

Patch 9 adds x86 test cases to the system counter KVM selftest. Lastly,
patch 10 documents the ioctls for both x86 and arm64.

All patches apply cleanly to kvm/next at the following commit:

a4345a7cecfb ("Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD")

Physical counter benchmark
--------------------------

The following data was collected by running 10000 iterations of the
benchmark test from Patch 6 on an Ampere Mt. Jade reference server, A 2S
machine with 2 80-core Ampere Altra SoCs. Measurements were collected
for both VHE and nVHE operation using the `kvm-arm.mode=` command-line
parameter.

nVHE
----

+--------------------+--------+---------+
|       Metric       | Native | Trapped |
+--------------------+--------+---------+
| Average            | 54ns   | 148ns   |
| Standard Deviation | 124ns  | 122ns   |
| 95th Percentile    | 258ns  | 348ns   |
+--------------------+--------+---------+

VHE
---

+--------------------+--------+---------+
|       Metric       | Native | Trapped |
+--------------------+--------+---------+
| Average            | 53ns   | 152ns   |
| Standard Deviation | 92ns   | 94ns    |
| 95th Percentile    | 204ns  | 307ns   |
+--------------------+--------+---------+

Oliver Upton (10):
  KVM: Introduce KVM_{GET,SET}_SYSTEM_COUNTER_STATE ioctls
  KVM: arm64: Implement initial support for KVM_CAP_SYSTEM_COUNTER_STATE
  selftests: KVM: Introduce system_counter_state_test
  KVM: arm64: Add userspace control of the guest's physical counter
  selftests: KVM: Add test cases for physical counter offsetting
  selftests: KVM: Add counter emulation benchmark
  KVM: x86: Refactor tsc synchronization code
  KVM: x86: Implement KVM_CAP_SYSTEM_COUNTER_STATE
  selftests: KVM: Add support for x86 to system_counter_state_test
  Documentation: KVM: Document KVM_{GET,SET}_SYSTEM_COUNTER_STATE ioctls

 Documentation/virt/kvm/api.rst                |  98 +++++++
 Documentation/virt/kvm/locking.rst            |  11 +
 arch/arm64/include/asm/kvm_host.h             |   6 +
 arch/arm64/include/asm/sysreg.h               |   1 +
 arch/arm64/include/uapi/asm/kvm.h             |  17 ++
 arch/arm64/kvm/arch_timer.c                   |  84 +++++-
 arch/arm64/kvm/arm.c                          |  25 ++
 arch/arm64/kvm/hyp/include/hyp/switch.h       |  31 +++
 arch/arm64/kvm/hyp/nvhe/timer-sr.c            |  16 +-
 arch/x86/include/asm/kvm_host.h               |   1 +
 arch/x86/include/uapi/asm/kvm.h               |   8 +
 arch/x86/kvm/x86.c                            | 176 +++++++++---
 include/uapi/linux/kvm.h                      |   5 +
 tools/testing/selftests/kvm/.gitignore        |   2 +
 tools/testing/selftests/kvm/Makefile          |   3 +
 .../kvm/aarch64/counter_emulation_benchmark.c | 209 ++++++++++++++
 .../selftests/kvm/include/aarch64/processor.h |  24 ++
 .../selftests/kvm/system_counter_state_test.c | 256 ++++++++++++++++++
 18 files changed, 926 insertions(+), 47 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/aarch64/counter_emulation_benchmark.c
 create mode 100644 tools/testing/selftests/kvm/system_counter_state_test.c

Comments

Paolo Bonzini June 9, 2021, 1:05 p.m. UTC | #1
On 08/06/21 23:47, Oliver Upton wrote:
> KVM's current means of saving/restoring system counters is plagued with
> temporal issues. At least on ARM64 and x86, we migrate the guest's
> system counter by-value through the respective guest system register
> values (cntvct_el0, ia32_tsc). Restoring system counters by-value is
> brittle as the state is not idempotent: the host system counter is still
> oscillating between the attempted save and restore. Furthermore, VMMs
> may wish to transparently live migrate guest VMs, meaning that they
> include the elapsed time due to live migration blackout in the guest
> system counter view. The VMM thread could be preempted for any number of
> reasons (scheduler, L0 hypervisor under nested) between the time that
> it calculates the desired guest counter value and when KVM actually sets
> this counter state.
> 
> Despite the value-based interface that we present to userspace, KVM
> actually has idempotent guest controls by way of system counter offsets.
> We can avoid all of the issues associated with a value-based interface
> by abstracting these offset controls in new ioctls. This series
> introduces KVM_{GET,SET}_SYSTEM_COUNTER_STATE ioctls, meant to provide
> userspace with idempotent controls of the guest system counter.

Hi Oliver,

I wonder how this compares to the idea of initializing the TSC via a 
synchronized (nanoseconds, TSC) pair. 
(https://lore.kernel.org/r/20201130133559.233242-2-mlevitsk@redhat.com), 
and whether it makes sense to apply that idea to ARM as well.  If so, it 
certainly is a good idea to use the same capability and ioctl, even 
though the details of the struct would be architecture-dependent.

In your patches there isn't much architecture dependency in struct 
kvm_system_counter_state.  However,  Maxim's also added an 
MSR_IA32_TSC_ADJUST value to the struct, thus ensuring that the host 
could write not just an arbitrary TSC value, but also tie it to an 
arbitrary MSR_IA32_TSC_ADJUST value.  Specifying both in the same ioctl 
simplifies the userspace API.

Paolo

> Patch 1 defines the ioctls, and was separated from the two provided
> implementations for the sake of review. If it is more intuitive, this
> patch can be squashed into the implementation commit.
> 
> Patch 2 realizes initial support for ARM64, migrating only the state
> associated with the guest's virtual counter-timer. Patch 3 introduces a
> KVM selftest to assert that userspace manipulation via the
> aforementioned ioctls produces the expected system counter values within
> the guest.
> 
> Patch 4 extends upon the ARM64 implementation by adding support for
> physical counter-timer offsetting. This is currently backed by a
> trap-and-emulate implementation, but can also be virtualized in hardware
> that fully implements ARMv8.6-ECV. ECV support has been elided from this
> series out of convenience for the author :) Patch 5 adds some test cases
> to the newly-minted kvm selftest to validate expectations of physical
> counter-timer emulation.
> 
> Patch 6 introduces yet another KVM selftest for aarch64, intended to
> measure the effects of physical counter-timer emulation. Data for this
> test can be found below, but basically there is some tradeoff of
> overhead for the sake of correctness, but it isn't too bad.
> 
> Patches 7-8 add support for the ioctls to x86 by shoehorning the
> controls into the pre-existing synchronization heuristics. Patch 7
> provides necessary helper methods for the implementation to play nice
> with those heuristics, and patch 8 actually implements the ioctls.
> 
> Patch 9 adds x86 test cases to the system counter KVM selftest. Lastly,
> patch 10 documents the ioctls for both x86 and arm64.
> 
> All patches apply cleanly to kvm/next at the following commit:
> 
> a4345a7cecfb ("Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD")
> 
> Physical counter benchmark
> --------------------------
> 
> The following data was collected by running 10000 iterations of the
> benchmark test from Patch 6 on an Ampere Mt. Jade reference server, A 2S
> machine with 2 80-core Ampere Altra SoCs. Measurements were collected
> for both VHE and nVHE operation using the `kvm-arm.mode=` command-line
> parameter.
> 
> nVHE
> ----
> 
> +--------------------+--------+---------+
> |       Metric       | Native | Trapped |
> +--------------------+--------+---------+
> | Average            | 54ns   | 148ns   |
> | Standard Deviation | 124ns  | 122ns   |
> | 95th Percentile    | 258ns  | 348ns   |
> +--------------------+--------+---------+
> 
> VHE
> ---
> 
> +--------------------+--------+---------+
> |       Metric       | Native | Trapped |
> +--------------------+--------+---------+
> | Average            | 53ns   | 152ns   |
> | Standard Deviation | 92ns   | 94ns    |
> | 95th Percentile    | 204ns  | 307ns   |
> +--------------------+--------+---------+
> 
> Oliver Upton (10):
>    KVM: Introduce KVM_{GET,SET}_SYSTEM_COUNTER_STATE ioctls
>    KVM: arm64: Implement initial support for KVM_CAP_SYSTEM_COUNTER_STATE
>    selftests: KVM: Introduce system_counter_state_test
>    KVM: arm64: Add userspace control of the guest's physical counter
>    selftests: KVM: Add test cases for physical counter offsetting
>    selftests: KVM: Add counter emulation benchmark
>    KVM: x86: Refactor tsc synchronization code
>    KVM: x86: Implement KVM_CAP_SYSTEM_COUNTER_STATE
>    selftests: KVM: Add support for x86 to system_counter_state_test
>    Documentation: KVM: Document KVM_{GET,SET}_SYSTEM_COUNTER_STATE ioctls
> 
>   Documentation/virt/kvm/api.rst                |  98 +++++++
>   Documentation/virt/kvm/locking.rst            |  11 +
>   arch/arm64/include/asm/kvm_host.h             |   6 +
>   arch/arm64/include/asm/sysreg.h               |   1 +
>   arch/arm64/include/uapi/asm/kvm.h             |  17 ++
>   arch/arm64/kvm/arch_timer.c                   |  84 +++++-
>   arch/arm64/kvm/arm.c                          |  25 ++
>   arch/arm64/kvm/hyp/include/hyp/switch.h       |  31 +++
>   arch/arm64/kvm/hyp/nvhe/timer-sr.c            |  16 +-
>   arch/x86/include/asm/kvm_host.h               |   1 +
>   arch/x86/include/uapi/asm/kvm.h               |   8 +
>   arch/x86/kvm/x86.c                            | 176 +++++++++---
>   include/uapi/linux/kvm.h                      |   5 +
>   tools/testing/selftests/kvm/.gitignore        |   2 +
>   tools/testing/selftests/kvm/Makefile          |   3 +
>   .../kvm/aarch64/counter_emulation_benchmark.c | 209 ++++++++++++++
>   .../selftests/kvm/include/aarch64/processor.h |  24 ++
>   .../selftests/kvm/system_counter_state_test.c | 256 ++++++++++++++++++
>   18 files changed, 926 insertions(+), 47 deletions(-)
>   create mode 100644 tools/testing/selftests/kvm/aarch64/counter_emulation_benchmark.c
>   create mode 100644 tools/testing/selftests/kvm/system_counter_state_test.c
>
Oliver Upton June 9, 2021, 3:11 p.m. UTC | #2
On Wed, Jun 9, 2021 at 8:06 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 08/06/21 23:47, Oliver Upton wrote:
> > KVM's current means of saving/restoring system counters is plagued with
> > temporal issues. At least on ARM64 and x86, we migrate the guest's
> > system counter by-value through the respective guest system register
> > values (cntvct_el0, ia32_tsc). Restoring system counters by-value is
> > brittle as the state is not idempotent: the host system counter is still
> > oscillating between the attempted save and restore. Furthermore, VMMs
> > may wish to transparently live migrate guest VMs, meaning that they
> > include the elapsed time due to live migration blackout in the guest
> > system counter view. The VMM thread could be preempted for any number of
> > reasons (scheduler, L0 hypervisor under nested) between the time that
> > it calculates the desired guest counter value and when KVM actually sets
> > this counter state.
> >
> > Despite the value-based interface that we present to userspace, KVM
> > actually has idempotent guest controls by way of system counter offsets.
> > We can avoid all of the issues associated with a value-based interface
> > by abstracting these offset controls in new ioctls. This series
> > introduces KVM_{GET,SET}_SYSTEM_COUNTER_STATE ioctls, meant to provide
> > userspace with idempotent controls of the guest system counter.
>
> Hi Oliver,
>
> I wonder how this compares to the idea of initializing the TSC via a
> synchronized (nanoseconds, TSC) pair.
> (https://lore.kernel.org/r/20201130133559.233242-2-mlevitsk@redhat.com),
> and whether it makes sense to apply that idea to ARM as well.  If so, it
> certainly is a good idea to use the same capability and ioctl, even
> though the details of the struct would be architecture-dependent.

Hey Paolo,

Yeah, great question, I had actually alluded to this on [02/10] in
talking to Marc about this.

Really the issue we want to avoid is sampling the host counter twice,
which at least based on the existing means of counter migration is
impossible as the VMM must account for elapsed time. Maxim's patches
appear to address that very issue as well.

Perhaps this will clarify the motivation for my approach: what if the
kernel wasn't the authoritative source for wall time in a system?
Furthermore, VMMs may wish to define their own heuristics for counter
migration (e.g. we only allow the counter to 'jump' by X seconds
during migration blackout). If a VMM tried to assert its whims on the
TSC state before handing it down to the kernel, we would inadvertently
be sampling the host counter twice again. And, anything can happen
between the time we assert elapsed time is within SLO and KVM
computing the TSC offset (scheduling, L0 hypervisor preemption).

So, Maxim's changes would address my concerns in the general case, but
maybe not as much in edge cases where an operator may make decisions
about how much time can elapse while the guest hasn't had CPU time.

--
Thanks,
Oliver

> In your patches there isn't much architecture dependency in struct
> kvm_system_counter_state.  However,  Maxim's also added an
> MSR_IA32_TSC_ADJUST value to the struct, thus ensuring that the host
> could write not just an arbitrary TSC value, but also tie it to an
> arbitrary MSR_IA32_TSC_ADJUST value.  Specifying both in the same ioctl
> simplifies the userspace API.
>
> Paolo
>
> > Patch 1 defines the ioctls, and was separated from the two provided
> > implementations for the sake of review. If it is more intuitive, this
> > patch can be squashed into the implementation commit.
> >
> > Patch 2 realizes initial support for ARM64, migrating only the state
> > associated with the guest's virtual counter-timer. Patch 3 introduces a
> > KVM selftest to assert that userspace manipulation via the
> > aforementioned ioctls produces the expected system counter values within
> > the guest.
> >
> > Patch 4 extends upon the ARM64 implementation by adding support for
> > physical counter-timer offsetting. This is currently backed by a
> > trap-and-emulate implementation, but can also be virtualized in hardware
> > that fully implements ARMv8.6-ECV. ECV support has been elided from this
> > series out of convenience for the author :) Patch 5 adds some test cases
> > to the newly-minted kvm selftest to validate expectations of physical
> > counter-timer emulation.
> >
> > Patch 6 introduces yet another KVM selftest for aarch64, intended to
> > measure the effects of physical counter-timer emulation. Data for this
> > test can be found below, but basically there is some tradeoff of
> > overhead for the sake of correctness, but it isn't too bad.
> >
> > Patches 7-8 add support for the ioctls to x86 by shoehorning the
> > controls into the pre-existing synchronization heuristics. Patch 7
> > provides necessary helper methods for the implementation to play nice
> > with those heuristics, and patch 8 actually implements the ioctls.
> >
> > Patch 9 adds x86 test cases to the system counter KVM selftest. Lastly,
> > patch 10 documents the ioctls for both x86 and arm64.
> >
> > All patches apply cleanly to kvm/next at the following commit:
> >
> > a4345a7cecfb ("Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD")
> >
> > Physical counter benchmark
> > --------------------------
> >
> > The following data was collected by running 10000 iterations of the
> > benchmark test from Patch 6 on an Ampere Mt. Jade reference server, A 2S
> > machine with 2 80-core Ampere Altra SoCs. Measurements were collected
> > for both VHE and nVHE operation using the `kvm-arm.mode=` command-line
> > parameter.
> >
> > nVHE
> > ----
> >
> > +--------------------+--------+---------+
> > |       Metric       | Native | Trapped |
> > +--------------------+--------+---------+
> > | Average            | 54ns   | 148ns   |
> > | Standard Deviation | 124ns  | 122ns   |
> > | 95th Percentile    | 258ns  | 348ns   |
> > +--------------------+--------+---------+
> >
> > VHE
> > ---
> >
> > +--------------------+--------+---------+
> > |       Metric       | Native | Trapped |
> > +--------------------+--------+---------+
> > | Average            | 53ns   | 152ns   |
> > | Standard Deviation | 92ns   | 94ns    |
> > | 95th Percentile    | 204ns  | 307ns   |
> > +--------------------+--------+---------+
> >
> > Oliver Upton (10):
> >    KVM: Introduce KVM_{GET,SET}_SYSTEM_COUNTER_STATE ioctls
> >    KVM: arm64: Implement initial support for KVM_CAP_SYSTEM_COUNTER_STATE
> >    selftests: KVM: Introduce system_counter_state_test
> >    KVM: arm64: Add userspace control of the guest's physical counter
> >    selftests: KVM: Add test cases for physical counter offsetting
> >    selftests: KVM: Add counter emulation benchmark
> >    KVM: x86: Refactor tsc synchronization code
> >    KVM: x86: Implement KVM_CAP_SYSTEM_COUNTER_STATE
> >    selftests: KVM: Add support for x86 to system_counter_state_test
> >    Documentation: KVM: Document KVM_{GET,SET}_SYSTEM_COUNTER_STATE ioctls
> >
> >   Documentation/virt/kvm/api.rst                |  98 +++++++
> >   Documentation/virt/kvm/locking.rst            |  11 +
> >   arch/arm64/include/asm/kvm_host.h             |   6 +
> >   arch/arm64/include/asm/sysreg.h               |   1 +
> >   arch/arm64/include/uapi/asm/kvm.h             |  17 ++
> >   arch/arm64/kvm/arch_timer.c                   |  84 +++++-
> >   arch/arm64/kvm/arm.c                          |  25 ++
> >   arch/arm64/kvm/hyp/include/hyp/switch.h       |  31 +++
> >   arch/arm64/kvm/hyp/nvhe/timer-sr.c            |  16 +-
> >   arch/x86/include/asm/kvm_host.h               |   1 +
> >   arch/x86/include/uapi/asm/kvm.h               |   8 +
> >   arch/x86/kvm/x86.c                            | 176 +++++++++---
> >   include/uapi/linux/kvm.h                      |   5 +
> >   tools/testing/selftests/kvm/.gitignore        |   2 +
> >   tools/testing/selftests/kvm/Makefile          |   3 +
> >   .../kvm/aarch64/counter_emulation_benchmark.c | 209 ++++++++++++++
> >   .../selftests/kvm/include/aarch64/processor.h |  24 ++
> >   .../selftests/kvm/system_counter_state_test.c | 256 ++++++++++++++++++
> >   18 files changed, 926 insertions(+), 47 deletions(-)
> >   create mode 100644 tools/testing/selftests/kvm/aarch64/counter_emulation_benchmark.c
> >   create mode 100644 tools/testing/selftests/kvm/system_counter_state_test.c
> >
>
Paolo Bonzini June 9, 2021, 5:05 p.m. UTC | #3
On 09/06/21 17:11, Oliver Upton wrote:
> Perhaps this will clarify the motivation for my approach: what if the
> kernel wasn't the authoritative source for wall time in a system?
> Furthermore, VMMs may wish to define their own heuristics for counter
> migration (e.g. we only allow the counter to 'jump' by X seconds
> during migration blackout). If a VMM tried to assert its whims on the
> TSC state before handing it down to the kernel, we would inadvertently
> be sampling the host counter twice again. And, anything can happen
> between the time we assert elapsed time is within SLO and KVM
> computing the TSC offset (scheduling, L0 hypervisor preemption).
> 
> So, Maxim's changes would address my concerns in the general case, but
> maybe not as much in edge cases where an operator may make decisions
> about how much time can elapse while the guest hasn't had CPU time.

I think I understand.  We still need a way to get a consistent 
(host_TSC, nanosecond) pair on the source, the TSC offset is not enough. 
  This is arguably not a KVM issue, but we're still the one having to 
provide a solution, so we would need a slightly more complicated interface.

1) In the kernel:

* KVM_GET_CLOCK should also return kvmclock_ns - realtime_ns and 
host_TSC.  It should set two flags in struct kvm_clock_data saying that 
the respective fields are valid.

* KVM_SET_CLOCK checks the flag for kvmclock_ns - realtime_ns.  If set, 
it looks at the kvmclock_ns - realtime_ns field and disregards the 
kvmclock_ns field.

2) On the source, userspace will:

* per-VM: invoke KVM_GET_CLOCK.  Migrate kvmclock_ns - realtime_ns and 
kvmclock_ns.  Stash host_TSC for subsequent use.

* per-vCPU: retrieve guest_TSC - host_TSC with your new ioctl.  Sum it 
to the stashed host_TSC value; migrate the resulting value (a guest TSC).

3) On the destination:

* per-VM: Pass the migrated kvmclock_ns - realtime_ns to KVM_SET_CLOCK. 
  Use KVM_GET_CLOCK to get a consistent pair of kvmclock_ns ("newNS" 
below) and host TSC ("newHostTSC").  Stash them for subsequent use, 
together with the migrated kvmclock_ns value ("sourceNS") that you 
haven't used yet.

* per-vCPU: using the data of the previous step, and the sourceGuestTSC 
in the migration stream, compute sourceGuestTSC + (newNS - sourceNS) * 
freq - newHostTSC.  This is the TSC offset to be passed to your new ioctl.

Your approach still needs to use the "quirky" approach to host-initiated 
MSR_IA32_TSC_ADJUST writes, which write the MSR without affecting the 
VMCS offset.  This is just a documentation issue.

Does this make sense?

Paolo
Oliver Upton June 9, 2021, 10:04 p.m. UTC | #4
On Wed, Jun 9, 2021 at 12:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 09/06/21 17:11, Oliver Upton wrote:
> > Perhaps this will clarify the motivation for my approach: what if the
> > kernel wasn't the authoritative source for wall time in a system?
> > Furthermore, VMMs may wish to define their own heuristics for counter
> > migration (e.g. we only allow the counter to 'jump' by X seconds
> > during migration blackout). If a VMM tried to assert its whims on the
> > TSC state before handing it down to the kernel, we would inadvertently
> > be sampling the host counter twice again. And, anything can happen
> > between the time we assert elapsed time is within SLO and KVM
> > computing the TSC offset (scheduling, L0 hypervisor preemption).
> >
> > So, Maxim's changes would address my concerns in the general case, but
> > maybe not as much in edge cases where an operator may make decisions
> > about how much time can elapse while the guest hasn't had CPU time.
>
> I think I understand.  We still need a way to get a consistent
> (host_TSC, nanosecond) pair on the source, the TSC offset is not enough.
>   This is arguably not a KVM issue, but we're still the one having to
> provide a solution, so we would need a slightly more complicated interface.

Ah, right, good luck doing that without some help from the kernel. +1
to this _not_ being a KVM issue, but anyways...

> 1) In the kernel:
>
> * KVM_GET_CLOCK should also return kvmclock_ns - realtime_ns and
> host_TSC.  It should set two flags in struct kvm_clock_data saying that
> the respective fields are valid.
>
> * KVM_SET_CLOCK checks the flag for kvmclock_ns - realtime_ns.  If set,
> it looks at the kvmclock_ns - realtime_ns field and disregards the
> kvmclock_ns field.

Yeah, these additions all make sense to me.

>
> 2) On the source, userspace will:
>
> * per-VM: invoke KVM_GET_CLOCK.  Migrate kvmclock_ns - realtime_ns and
> kvmclock_ns.  Stash host_TSC for subsequent use.
>
> * per-vCPU: retrieve guest_TSC - host_TSC with your new ioctl.  Sum it
> to the stashed host_TSC value; migrate the resulting value (a guest TSC).
>
> 3) On the destination:
>
> * per-VM: Pass the migrated kvmclock_ns - realtime_ns to KVM_SET_CLOCK.
>   Use KVM_GET_CLOCK to get a consistent pair of kvmclock_ns ("newNS"
> below) and host TSC ("newHostTSC").  Stash them for subsequent use,
> together with the migrated kvmclock_ns value ("sourceNS") that you
> haven't used yet.
>
> * per-vCPU: using the data of the previous step, and the sourceGuestTSC
> in the migration stream, compute sourceGuestTSC + (newNS - sourceNS) *
> freq - newHostTSC.  This is the TSC offset to be passed to your new ioctl.
>
> Your approach still needs to use the "quirky" approach to host-initiated
> MSR_IA32_TSC_ADJUST writes, which write the MSR without affecting the
> VMCS offset.  This is just a documentation issue.

I think I follow what you're saying. To confirm:

My suggested ioctl for the vCPU will still exist, and it will still
affect the VMCS tsc offset, right? However, we need to do one of the
following:

- Stash the guest's MSR_IA32_TSC_ADJUST value in the
kvm_system_counter_state structure. During
KVM_SET_SYSTEM_COUNTER_STATE, check to see if the field is valid. If
so, treat it as a dumb value (i.e. show it to the guest but don't fold
it into the offset).
- Inform userspace that it must still migrate MSR_IA32_TSC_ADJUST, and
continue to our quirky behavior around host-initiated writes of the
MSR.

This is why Maxim's spin migrated a value for IA32_TSC_ADJUST, right?
Doing so ensures we don't have any guest-observable consequences due
to our migration of TSC state. To me, adding the guest IA32_TSC_ADJUST
MSR into the new counter state structure is probably best. No strong
opinions in either direction on this point, though :)

--
Thanks,
Oliver

>
> Does this make sense?
>
> Paolo
>
Paolo Bonzini June 10, 2021, 6:22 a.m. UTC | #5
On 10/06/21 00:04, Oliver Upton wrote:
>> Your approach still needs to use the "quirky" approach to host-initiated
>> MSR_IA32_TSC_ADJUST writes, which write the MSR without affecting the
>> VMCS offset.  This is just a documentation issue.
> 
> My suggested ioctl for the vCPU will still exist, and it will still
> affect the VMCS tsc offset, right? However, we need to do one of the
> following:
> 
> - Stash the guest's MSR_IA32_TSC_ADJUST value in the
> kvm_system_counter_state structure. During
> KVM_SET_SYSTEM_COUNTER_STATE, check to see if the field is valid. If
> so, treat it as a dumb value (i.e. show it to the guest but don't fold
> it into the offset).

Yes, it's already folded into the guestTSC-hostTSC offset that the 
caller provides.

> - Inform userspace that it must still migrate MSR_IA32_TSC_ADJUST, and
> continue to our quirky behavior around host-initiated writes of the
> MSR.
> 
> This is why Maxim's spin migrated a value for IA32_TSC_ADJUST, right?

Yes, so that he could then remove (optionally) the quirk for 
host-initiated writes to the TSC and TSC_ADJUST MSRs.

> Doing so ensures we don't have any guest-observable consequences due
> to our migration of TSC state. To me, adding the guest IA32_TSC_ADJUST
> MSR into the new counter state structure is probably best. No strong
> opinions in either direction on this point, though:)

Either is good for me, since documentation will be very important either 
way.  This is a complex API to use due to the possibility of skewed TSCs.

Just one thing, please don't introduce a new ioctl and use 
KVM_GET_DEVICE_ATTR/KVM_SET_DEVICE_ATTR/KVM_HAS_DEVICE_ATTR.

Christian, based on what Oliver mentions here, it's probably useful for 
s390 to have functionality to get/set kvm->arch.epoch and kvm->arch.epdx 
in addition to the absolute TOD values that you are migrating now.

Paolo

>> 1) In the kernel:
>>
>> * KVM_GET_CLOCK should also return kvmclock_ns - realtime_ns and
>> host_TSC.  It should set two flags in struct kvm_clock_data saying that
>> the respective fields are valid.
>>
>> * KVM_SET_CLOCK checks the flag for kvmclock_ns - realtime_ns.  If set,
>> it looks at the kvmclock_ns - realtime_ns field and disregards the
>> kvmclock_ns field.
>>
>> 2) On the source, userspace will:
>>
>> * per-VM: invoke KVM_GET_CLOCK.  Migrate kvmclock_ns - realtime_ns and
>> kvmclock_ns.  Stash host_TSC for subsequent use.
>>
>> * per-vCPU: retrieve guest_TSC - host_TSC with your new ioctl.  Sum it
>> to the stashed host_TSC value; migrate the resulting value (a guest TSC).
>>
>> 3) On the destination:
>>
>> * per-VM: Pass the migrated kvmclock_ns - realtime_ns to KVM_SET_CLOCK.
>>   Use KVM_GET_CLOCK to get a consistent pair of kvmclock_ns ("newNS"
>> below) and host TSC ("newHostTSC").  Stash them for subsequent use,
>> together with the migrated kvmclock_ns value ("sourceNS") that you
>> haven't used yet.
>>
>> * per-vCPU: using the data of the previous step, and the sourceGuestTSC
>> in the migration stream, compute sourceGuestTSC + (newNS - sourceNS) *
>> freq - newHostTSC.  This is the TSC offset to be passed to your new ioctl.
Christian Borntraeger June 10, 2021, 6:53 a.m. UTC | #6
On 10.06.21 08:22, Paolo Bonzini wrote:
> On 10/06/21 00:04, Oliver Upton wrote:
>>> Your approach still needs to use the "quirky" approach to host-initiated
>>> MSR_IA32_TSC_ADJUST writes, which write the MSR without affecting the
>>> VMCS offset.  This is just a documentation issue.
>>
>> My suggested ioctl for the vCPU will still exist, and it will still
>> affect the VMCS tsc offset, right? However, we need to do one of the
>> following:
>>
>> - Stash the guest's MSR_IA32_TSC_ADJUST value in the
>> kvm_system_counter_state structure. During
>> KVM_SET_SYSTEM_COUNTER_STATE, check to see if the field is valid. If
>> so, treat it as a dumb value (i.e. show it to the guest but don't fold
>> it into the offset).
> 
> Yes, it's already folded into the guestTSC-hostTSC offset that the caller provides.
> 
>> - Inform userspace that it must still migrate MSR_IA32_TSC_ADJUST, and
>> continue to our quirky behavior around host-initiated writes of the
>> MSR.
>>
>> This is why Maxim's spin migrated a value for IA32_TSC_ADJUST, right?
> 
> Yes, so that he could then remove (optionally) the quirk for host-initiated writes to the TSC and TSC_ADJUST MSRs.
> 
>> Doing so ensures we don't have any guest-observable consequences due
>> to our migration of TSC state. To me, adding the guest IA32_TSC_ADJUST
>> MSR into the new counter state structure is probably best. No strong
>> opinions in either direction on this point, though:)
> 
> Either is good for me, since documentation will be very important either way.  This is a complex API to use due to the possibility of skewed TSCs.
> 
> Just one thing, please don't introduce a new ioctl and use KVM_GET_DEVICE_ATTR/KVM_SET_DEVICE_ATTR/KVM_HAS_DEVICE_ATTR.
> 
> Christian, based on what Oliver mentions here, it's probably useful for s390 to have functionality to get/set kvm->arch.epoch and kvm->arch.epdx in addition to the absolute TOD values that you are migrating now.

Yes, a scheme where we migrate the offsets (assuming that the hosts are synced) would be often better.
If the hosts are not synced, things will be harder. I will have a look at this series, Thanks for the pointer.