mbox series

[RFC,v3,0/5] x86/kvm: Virtual suspend time injection support

Message ID 20211020120431.776494-1-hikalium@chromium.org (mailing list archive)
Headers show
Series x86/kvm: Virtual suspend time injection support | expand

Message

Hikaru Nishida Oct. 20, 2021, 12:04 p.m. UTC
Hi,

This patch series adds virtual suspend time injection support to KVM.
It is an updated version of the following series:
v2:
https://lore.kernel.org/kvm/20210806100710.2425336-1-hikalium@chromium.org/
v1:
https://lore.kernel.org/kvm/20210426090644.2218834-1-hikalium@chromium.org/

Please take a look again.

To kvm/arm64 folks:
I'm going to implement this mechanism to ARM64 as well but not
sure which function should be used to make an IRQ (like kvm_apic_set_irq
in x86) and if it is okay to use kvm_gfn_to_hva_cache /
kvm_write_guest_cached for sharing the suspend duration.
Please let me know if there is other suitable way or any suggestions.

Thanks,

Hikaru Nishida


Changes in v3:
- Used PM notifier instead of modifying timekeeping_resume()
  - This avoids holding kvm_lock under interrupt disabled context.
- Used KVM_REQ_* to make a request for vcpus.
- Reused HYPERVISOR_CALLBACK_VECTOR IRQ instead of adding a new one.
- Extracted arch-independent parts.
- Fixed other reviewed points.

Hikaru Nishida (5):
  timekeeping: Expose tk->offs_boot via ktime_get_offs_boot_ns
  kvm/x86: Include asm/pvclock.h in asm/kvmclock.h
  kvm/x86: virtual suspend time injection: Add common definitions
  kvm/x86: virtual suspend time injection: Implement host side
  kvm/x86: virtual suspend time injection: Implement guest side

 Documentation/virt/kvm/cpuid.rst     |   3 +
 Documentation/virt/kvm/msr.rst       |  30 ++++++++
 arch/x86/Kconfig                     |  13 ++++
 arch/x86/include/asm/idtentry.h      |   2 +-
 arch/x86/include/asm/kvm_host.h      |   2 +
 arch/x86/include/asm/kvmclock.h      |  11 +++
 arch/x86/include/uapi/asm/kvm_para.h |   6 ++
 arch/x86/kernel/kvm.c                |  14 +++-
 arch/x86/kernel/kvmclock.c           |  26 +++++++
 arch/x86/kvm/Kconfig                 |  13 ++++
 arch/x86/kvm/cpuid.c                 |   4 +
 arch/x86/kvm/x86.c                   | 109 +++++++++++++++++++++++++++
 arch/x86/mm/fault.c                  |   2 +-
 include/linux/kvm_host.h             |  48 ++++++++++++
 include/linux/timekeeper_internal.h  |   5 ++
 include/linux/timekeeping.h          |   6 ++
 kernel/time/timekeeping.c            |  56 ++++++++++++++
 virt/kvm/kvm_main.c                  |  88 +++++++++++++++++++++
 18 files changed, 432 insertions(+), 6 deletions(-)

Comments

Marc Zyngier Oct. 20, 2021, 1:52 p.m. UTC | #1
Hi Hikaru,

On Wed, 20 Oct 2021 13:04:25 +0100,
Hikaru Nishida <hikalium@chromium.org> wrote:
> 
> 
> Hi,
> 
> This patch series adds virtual suspend time injection support to KVM.
> It is an updated version of the following series:
> v2:
> https://lore.kernel.org/kvm/20210806100710.2425336-1-hikalium@chromium.org/
> v1:
> https://lore.kernel.org/kvm/20210426090644.2218834-1-hikalium@chromium.org/
> 
> Please take a look again.
> 
> To kvm/arm64 folks:
> I'm going to implement this mechanism to ARM64 as well but not
> sure which function should be used to make an IRQ (like kvm_apic_set_irq
> in x86) and if it is okay to use kvm_gfn_to_hva_cache /
> kvm_write_guest_cached for sharing the suspend duration.

Before we discuss interrupt injection, I want to understand what this
is doing, and how this is doing it. And more precisely, I want to find
out how you solve the various problems described by Thomas here [1].

Assuming you solve these, you should model the guest memory access
similarly to what we do for stolen time. As for injecting an
interrupt, why can't this be a userspace thing?

Thanks,

	M.

[1] https://lore.kernel.org/all/871r557jls.ffs@tglx
Hikaru Nishida Nov. 4, 2021, 9:10 a.m. UTC | #2
Hi Marc,

Thanks for the comments! (Sorry for the late reply)

On Wed, Oct 20, 2021 at 10:52 PM Marc Zyngier <maz@kernel.org> wrote:
>
> Hi Hikaru,
>
> On Wed, 20 Oct 2021 13:04:25 +0100,
> Hikaru Nishida <hikalium@chromium.org> wrote:
> >
> >
> > Hi,
> >
> > This patch series adds virtual suspend time injection support to KVM.
> > It is an updated version of the following series:
> > v2:
> > https://lore.kernel.org/kvm/20210806100710.2425336-1-hikalium@chromium.org/
> > v1:
> > https://lore.kernel.org/kvm/20210426090644.2218834-1-hikalium@chromium.org/
> >
> > Please take a look again.
> >
> > To kvm/arm64 folks:
> > I'm going to implement this mechanism to ARM64 as well but not
> > sure which function should be used to make an IRQ (like kvm_apic_set_irq
> > in x86) and if it is okay to use kvm_gfn_to_hva_cache /
> > kvm_write_guest_cached for sharing the suspend duration.
>
> Before we discuss interrupt injection, I want to understand what this
> is doing, and how this is doing it. And more precisely, I want to find
> out how you solve the various problems described by Thomas here [1].

The problems described by Thomas in the thread was:
- User space or kernel space can observe the stale timestamp before
the adjustment
  - Moving CLOCK_MONOTONIC forward will trigger all sorts of timeouts,
watchdogs, etc...
- The last attempt to make CLOCK_MONOTONIC behave like CLOCK_BOOTTIME
was reverted within 3 weeks. a3ed0e4393d6 ("Revert: Unify
CLOCK_MONOTONIC and CLOCK_BOOTTIME")
  - CLOCK_MONOTONIC correctness (stops during the suspend) should be maintained.

I agree with the points above. And, the current CLOCK_MONOTONIC
behavior in the KVM guest is not aligned with the statements above.
(it advances during the host's suspension.)
This causes the problems described above (triggering watchdog
timeouts, etc...) so my patches are going to fix this by 2 steps
roughly:
1. Stopping the guest's clocks during the host's suspension
2. Adjusting CLOCK_BOOTTIME later
This will make the clocks behave like the host does, not making
CLOCK_MONOTONIC behave like CLOCK_BOOTTIME.

First one is a bit tricky since the guest can use a timestamp counter
in each CPUs (TSC in x86) and we need to adjust it without stale
values are observed by the guest kernel to prevent rewinding of
CLOCK_MONOTONIC (which is our top priority to make the kernel happy).
To achieve this, my patch adjusts TSCs (and a kvm-clock) before the
first vcpu runs of each vcpus after the resume.

Second one is relatively safe: since jumping CLOCK_BOOTTIME forward
can happen even before my patches when suspend/resume happens, and
that will not break the monotonicity of the clocks, we can do that
through IRQ.

[1] shows the flow of the adjustment logic, and [2] shows how the
clocks behave in the guest and the host before/after my patches.
The numbers on each step in [1] corresponds to the timing shown in [2].
The left side of [2] is showing the behavior of the clocks before the
patches, and the right side shows after the patches. Also, upper
charts show the guest clocks, and bottom charts are host clocks.

Before the patches(left side), CLOCK_MONOTONIC seems to be jumped from
the guest's perspective after the host's suspension. As Thomas says,
large jumps of CLOCK_MONOTONIC may lead to watchdog timeouts and other
bad things that we want to avoid.
With the patches(right side), both clocks will be adjusted (t=4,5) as
if they are stopped during the suspension. This adjustment is done by
the host side and invisible to the guest since it is done before the
first vcpu run after the resume. After that, CLOCK_BOOTTIME will be
adjusted from the guest side, triggered by the IRQ sent from the host.

[1]: https://hikalium.com/files/kvm_virt_suspend_time_seq.png
[2]: https://hikalium.com/files/kvm_virt_suspend_time_clocks.png


>
> Assuming you solve these, you should model the guest memory access
> similarly to what we do for stolen time. As for injecting an
> interrupt, why can't this be a userspace thing?

Since CLOCK_BOOTTIME is calculated by adding a gap
(tk->monotonic_to_boot) to CLOCK_MONOTONIC, and there are no way to
change the value from the outside of the guest kernel, we should
implement some mechanism in the kernel to adjust it.
(Actually, I tried to add a sysfs interface to modify the gap [3], but
I learned that that is not a good idea...)

[3]: https://lore.kernel.org/lkml/87eehoax14.fsf@nanos.tec.linutronix.de/

Thank you,

Hikaru Nishida

>
> Thanks,
>
>         M.
>
> [1] https://lore.kernel.org/all/871r557jls.ffs@tglx
>
>
> --
> Without deviation from the norm, progress is not possible.
Marc Zyngier Dec. 4, 2021, 5:30 p.m. UTC | #3
Hi Hikaru,

Apologies for the much delayed reply.

> The problems described by Thomas in the thread was:
> - User space or kernel space can observe the stale timestamp before
> the adjustment
>   - Moving CLOCK_MONOTONIC forward will trigger all sorts of timeouts,
> watchdogs, etc...
> - The last attempt to make CLOCK_MONOTONIC behave like CLOCK_BOOTTIME
> was reverted within 3 weeks. a3ed0e4393d6 ("Revert: Unify
> CLOCK_MONOTONIC and CLOCK_BOOTTIME")
>   - CLOCK_MONOTONIC correctness (stops during the suspend) should be maintained.
> 
> I agree with the points above. And, the current CLOCK_MONOTONIC
> behavior in the KVM guest is not aligned with the statements above.
> (it advances during the host's suspension.)
> This causes the problems described above (triggering watchdog
> timeouts, etc...) so my patches are going to fix this by 2 steps
> roughly:
> 1. Stopping the guest's clocks during the host's suspension
> 2. Adjusting CLOCK_BOOTTIME later
> This will make the clocks behave like the host does, not making
> CLOCK_MONOTONIC behave like CLOCK_BOOTTIME.
> 
> First one is a bit tricky since the guest can use a timestamp counter
> in each CPUs (TSC in x86) and we need to adjust it without stale
> values are observed by the guest kernel to prevent rewinding of
> CLOCK_MONOTONIC (which is our top priority to make the kernel happy).
> To achieve this, my patch adjusts TSCs (and a kvm-clock) before the
> first vcpu runs of each vcpus after the resume.
> 
> Second one is relatively safe: since jumping CLOCK_BOOTTIME forward
> can happen even before my patches when suspend/resume happens, and
> that will not break the monotonicity of the clocks, we can do that
> through IRQ.
> 
> [1] shows the flow of the adjustment logic, and [2] shows how the
> clocks behave in the guest and the host before/after my patches.
> The numbers on each step in [1] corresponds to the timing shown in [2].
> The left side of [2] is showing the behavior of the clocks before the
> patches, and the right side shows after the patches. Also, upper
> charts show the guest clocks, and bottom charts are host clocks.
> 
> Before the patches(left side), CLOCK_MONOTONIC seems to be jumped from
> the guest's perspective after the host's suspension. As Thomas says,
> large jumps of CLOCK_MONOTONIC may lead to watchdog timeouts and other
> bad things that we want to avoid.
> With the patches(right side), both clocks will be adjusted (t=4,5) as
> if they are stopped during the suspension. This adjustment is done by
> the host side and invisible to the guest since it is done before the
> first vcpu run after the resume. After that, CLOCK_BOOTTIME will be
> adjusted from the guest side, triggered by the IRQ sent from the host.
> 
> [1]: https://hikalium.com/files/kvm_virt_suspend_time_seq.png
> [2]: https://hikalium.com/files/kvm_virt_suspend_time_clocks.png

Thanks for the very detailed explanation. You obviously have though
about this, and it makes sense.

My worry is that this looks to be designed for the needs of Linux on
x86, and does not match the reality of KVM on arm64, where there is no
KVM clock (there is no need for it, and I have no plan to support it),
and there is more than a single counter visible to the guest (at least
two, and up to four with NV, all with various offsets). This also
deals with concepts that are Linux-specific. How would it work for
another (arbitrary) guest operating system?

Can we please take a step back and look at what we want to expose from
a hypervisor PoV? It seems to me that all we want is:

(1) tell the guest that time has moved forward
(2) tell the guest by how much time has moved forward

In a way, this isn't different from stolen time, only that it affects
the whole VM and not just a single CPU (and for a much longer quantum
of time).

How the event is handled by the guest (what it means for its clocks
and all that) is a guest problem. Why should KVM itself adjust the
counters? This goes against what the architecture specifies (the
counter is in an always-on domain that keeps counting when suspended),
and KVM must preserve the architectural guarantees instead of
deviating from them.

> > Assuming you solve these, you should model the guest memory access
> > similarly to what we do for stolen time. As for injecting an
> > interrupt, why can't this be a userspace thing?
> 
> Since CLOCK_BOOTTIME is calculated by adding a gap
> (tk->monotonic_to_boot) to CLOCK_MONOTONIC, and there are no way to
> change the value from the outside of the guest kernel, we should
> implement some mechanism in the kernel to adjust it.
> (Actually, I tried to add a sysfs interface to modify the gap [3], but
> I learned that that is not a good idea...)

It is not what I was suggesting.

My suggestion was to have a shared memory between the VMM and the
guest again, similar to the way we handle stolen time), let the VMM
expose the drift in this shared memory, and inject an interrupt from
userspace to signify a wake-up. All this requires is that on suspend,
you force the vcpus to exit. On resume, the VMM update the guest
visible drift, posts an interrupt, and let things rip.

This requires very minimal KVM support, and squarely places the logic
in the guest. Why doesn't this work?

Another question is maybe even more naive: on bare metal, we don't
need any of this. The system suspends, resumes, and recovers well
enough. Nobody hides anything, and yet everything works just fine.
That's because the kernel knows it is being suspended, and it acts
accordingly. It looks to me that there is some value in following the
same principles, even if this means that the host suspend has to
synchronise with the guest being suspended.

Thanks,

	M.