diff mbox series

[v3,1/3] kvm: Introduce kvm_total_suspend_ns().

Message ID 20250107042202.2554063-2-suleiman@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: Include host suspended time in steal time. | expand

Commit Message

Suleiman Souhlal Jan. 7, 2025, 4:21 a.m. UTC
It returns the cumulative nanoseconds that the host has been suspended.
It is intended to be used for reporting host suspend time to the guest.

Change-Id: I8f644c9fbdb2c48d2c99dd9efaa5c85a83a14c2a
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      | 26 ++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

Comments

Sean Christopherson Jan. 7, 2025, 3:27 p.m. UTC | #1
KVM: for the scope.

On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> It returns the cumulative nanoseconds that the host has been suspended.

Please write changelogs that are standalone.  "It returns ..." is completely
nonsensical without the context of the shortlog.

> It is intended to be used for reporting host suspend time to the guest.
> 
> Change-Id: I8f644c9fbdb2c48d2c99dd9efaa5c85a83a14c2a

Drop gerrit's metadata before posting.

> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
>  include/linux/kvm_host.h |  2 ++
>  virt/kvm/kvm_main.c      | 26 ++++++++++++++++++++++++++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 401439bb21e3e6..cf926168b30820 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2553,4 +2553,6 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  				    struct kvm_pre_fault_memory *range);
>  #endif
>  
> +u64 kvm_total_suspend_ns(void);
> +
>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index de2c11dae23163..d5ae237df76d0d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -889,13 +889,39 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>  
>  #endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */
>  
> +static u64 last_suspend;
> +static u64 total_suspend_ns;
> +
> +u64 kvm_total_suspend_ns(void)
> +{
> +	return total_suspend_ns;
> +}
> +
> +

Extra whitespace.

>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> +{
> +	switch (state) {
> +	case PM_HIBERNATION_PREPARE:
> +	case PM_SUSPEND_PREPARE:
> +		last_suspend = ktime_get_boottime_ns();
> +	case PM_POST_HIBERNATION:
> +	case PM_POST_SUSPEND:
> +		total_suspend_ns += ktime_get_boottime_ns() - last_suspend;

This is broken.  As should be quite clear from the function parameters, this is
a per-VM notifier.  While clobbering "last_suspend" is relatively benign,
accumulating into "total_suspend_ns" for every VM will cause the "total" suspend
time to be wildly inaccurate if there are multiple VMs.

> +	}
> +
> +	return NOTIFY_DONE;
> +}
> +
>  static int kvm_pm_notifier_call(struct notifier_block *bl,
>  				unsigned long state,
>  				void *unused)
>  {
>  	struct kvm *kvm = container_of(bl, struct kvm, pm_notifier);
>  
> +	if (kvm_pm_notifier(kvm, state) != NOTIFY_DONE)
> +		return NOTIFY_BAD;

This is ridiculous on multiple fronts.  There's zero reason for kvm_pm_notifier()
to return a value, it always succeeds.  I also see zero reason to put this code
in common KVM.  It's 100% an x86-only concept, and x86 is the only architecture
that implements kvm_arch_pm_notifier().  So just shove the logic into x86's
implementation.

> +
>  	return kvm_arch_pm_notifier(kvm, state);
>  }
>  
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
>
Suleiman Souhlal Jan. 7, 2025, 4:43 p.m. UTC | #2
On Wed, Jan 8, 2025 at 12:27 AM Sean Christopherson <seanjc@google.com> wrote:
>
> KVM: for the scope.
>
> On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> > It returns the cumulative nanoseconds that the host has been suspended.
>
> Please write changelogs that are standalone.  "It returns ..." is completely
> nonsensical without the context of the shortlog.
>
> > It is intended to be used for reporting host suspend time to the guest.
> >
> > Change-Id: I8f644c9fbdb2c48d2c99dd9efaa5c85a83a14c2a
>
> Drop gerrit's metadata before posting.
>
> > Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> > ---
> >  include/linux/kvm_host.h |  2 ++
> >  virt/kvm/kvm_main.c      | 26 ++++++++++++++++++++++++++
> >  2 files changed, 28 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 401439bb21e3e6..cf926168b30820 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2553,4 +2553,6 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >                                   struct kvm_pre_fault_memory *range);
> >  #endif
> >
> > +u64 kvm_total_suspend_ns(void);
> > +
> >  #endif
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index de2c11dae23163..d5ae237df76d0d 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -889,13 +889,39 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >
> >  #endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */
> >
> > +static u64 last_suspend;
> > +static u64 total_suspend_ns;
> > +
> > +u64 kvm_total_suspend_ns(void)
> > +{
> > +     return total_suspend_ns;
> > +}
> > +
> > +
>
> Extra whitespace.
>
> >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> > +{
> > +     switch (state) {
> > +     case PM_HIBERNATION_PREPARE:
> > +     case PM_SUSPEND_PREPARE:
> > +             last_suspend = ktime_get_boottime_ns();
> > +     case PM_POST_HIBERNATION:
> > +     case PM_POST_SUSPEND:
> > +             total_suspend_ns += ktime_get_boottime_ns() - last_suspend;
>
> This is broken.  As should be quite clear from the function parameters, this is
> a per-VM notifier.  While clobbering "last_suspend" is relatively benign,
> accumulating into "total_suspend_ns" for every VM will cause the "total" suspend
> time to be wildly inaccurate if there are multiple VMs.

Good catch. Thanks for spotting that.

>
> > +     }
> > +
> > +     return NOTIFY_DONE;
> > +}
> > +
> >  static int kvm_pm_notifier_call(struct notifier_block *bl,
> >                               unsigned long state,
> >                               void *unused)
> >  {
> >       struct kvm *kvm = container_of(bl, struct kvm, pm_notifier);
> >
> > +     if (kvm_pm_notifier(kvm, state) != NOTIFY_DONE)
> > +             return NOTIFY_BAD;
>
> This is ridiculous on multiple fronts.  There's zero reason for kvm_pm_notifier()
> to return a value, it always succeeds.  I also see zero reason to put this code
> in common KVM.  It's 100% an x86-only concept, and x86 is the only architecture
> that implements kvm_arch_pm_notifier().  So just shove the logic into x86's
> implementation.

The feedback I thought I got in v1 was that this could be useful for
other architectures too.
But given that the current series only implements it for x86, I guess
it's fair for it to be out of common KVM for now.
I will move it back into x86's implementation.

Thanks for the very quick reviews.
-- Suleiman
kernel test robot Jan. 8, 2025, 7:15 a.m. UTC | #3
Hi Suleiman,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kvm/queue]
[also build test WARNING on kvm/next linus/master v6.13-rc6 next-20250107]
[cannot apply to kvm/linux-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Suleiman-Souhlal/kvm-Introduce-kvm_total_suspend_ns/20250107-122819
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link:    https://lore.kernel.org/r/20250107042202.2554063-2-suleiman%40google.com
patch subject: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
config: s390-randconfig-001-20250108 (https://download.01.org/0day-ci/archive/20250108/202501081504.AE4wDCL9-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250108/202501081504.AE4wDCL9-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501081504.AE4wDCL9-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> arch/s390/kvm/../../../virt/kvm/kvm_main.c:897:12: warning: 'last_suspend' defined but not used [-Wunused-variable]
     897 | static u64 last_suspend;
         |            ^~~~~~~~~~~~


vim +/last_suspend +897 arch/s390/kvm/../../../virt/kvm/kvm_main.c

   896	
 > 897	static u64 last_suspend;
   898	static u64 total_suspend_ns;
   899
kernel test robot Jan. 8, 2025, 8:36 a.m. UTC | #4
Hi Suleiman,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kvm/queue]
[also build test WARNING on kvm/next linus/master v6.13-rc6 next-20250107]
[cannot apply to kvm/linux-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Suleiman-Souhlal/kvm-Introduce-kvm_total_suspend_ns/20250107-122819
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link:    https://lore.kernel.org/r/20250107042202.2554063-2-suleiman%40google.com
patch subject: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().
config: loongarch-randconfig-002-20250108 (https://download.01.org/0day-ci/archive/20250108/202501081616.nFwj9jxx-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250108/202501081616.nFwj9jxx-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501081616.nFwj9jxx-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> arch/loongarch/kvm/../../../virt/kvm/kvm_main.c:897:12: warning: 'last_suspend' defined but not used [-Wunused-variable]
     897 | static u64 last_suspend;
         |            ^~~~~~~~~~~~


vim +/last_suspend +897 arch/loongarch/kvm/../../../virt/kvm/kvm_main.c

   896	
 > 897	static u64 last_suspend;
   898	static u64 total_suspend_ns;
   899
Sean Christopherson Jan. 15, 2025, 9:49 p.m. UTC | #5
On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> It returns the cumulative nanoseconds that the host has been suspended.
> It is intended to be used for reporting host suspend time to the guest.

...

>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> +{
> +	switch (state) {
> +	case PM_HIBERNATION_PREPARE:
> +	case PM_SUSPEND_PREPARE:
> +		last_suspend = ktime_get_boottime_ns();
> +	case PM_POST_HIBERNATION:
> +	case PM_POST_SUSPEND:
> +		total_suspend_ns += ktime_get_boottime_ns() - last_suspend;

After spending too much time poking around kvmlock and sched_clock code, I'm pretty
sure that accounting *all* suspend time to steal_time is wildly inaccurate for
most clocksources that will be used by KVM x86 guests.

KVM already adjusts TSC, and by extension kvmclock, to account for the TSC going
backwards due to suspend+resume.  I haven't dug super deep, buy I assume/hope the
majority of suspend time is handled by massaging guest TSC.

There's still a notable gap, as KVM's TSC adjustments likely won't account for
the lag between CPUs coming online and vCPU's being restarted, but I don't know
that having KVM account the suspend duration is the right way to solve that issue.
Suleiman Souhlal Jan. 17, 2025, 6:35 a.m. UTC | #6
On Thu, Jan 16, 2025 at 6:49 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> > It returns the cumulative nanoseconds that the host has been suspended.
> > It is intended to be used for reporting host suspend time to the guest.
>
> ...
>
> >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> > +{
> > +     switch (state) {
> > +     case PM_HIBERNATION_PREPARE:
> > +     case PM_SUSPEND_PREPARE:
> > +             last_suspend = ktime_get_boottime_ns();
> > +     case PM_POST_HIBERNATION:
> > +     case PM_POST_SUSPEND:
> > +             total_suspend_ns += ktime_get_boottime_ns() - last_suspend;
>
> After spending too much time poking around kvmlock and sched_clock code, I'm pretty
> sure that accounting *all* suspend time to steal_time is wildly inaccurate for
> most clocksources that will be used by KVM x86 guests.
>
> KVM already adjusts TSC, and by extension kvmclock, to account for the TSC going
> backwards due to suspend+resume.  I haven't dug super deep, buy I assume/hope the
> majority of suspend time is handled by massaging guest TSC.
>
> There's still a notable gap, as KVM's TSC adjustments likely won't account for
> the lag between CPUs coming online and vCPU's being restarted, but I don't know
> that having KVM account the suspend duration is the right way to solve that issue.

(It is my understanding that steal time has no impact on clock sources.)
On our machines, the problem isn't that the TSC is going backwards. As
you say, kvmclock takes care of that.

The problem these patches are trying to solve is that the time keeps
advancing for the VM while the host is suspended.
On the host, this problem does not happen because timekeeping is
stopped by timekeeping_suspend().
The problem with time advancing is that the guest scheduler thinks the
task that was running when the host suspended was running for the
whole duration, which was especially bad when we still had RT
throttling (prior to v6.12), as in the case of a RT task being
current, the scheduler would then throttle all RT tasks for a very
long time. With dlserver, the problem is much less severe, but still
exists.
There is a similar problem when the host CPU is overcommitted, where
the guest scheduler thinks the current task ran for the full duration,
even though the effective running time was much lower. This is exactly
what steal time solves, which is why I thought addressing the suspend
issue with steal time was an acceptable approach.

TSC adjustments would also be a way to address the issue, but we would
then need another mechanism to still advance the guest wall time
during the host suspension.
We have tried that approach in the past, and it was working pretty
well for us, but it did not seem popular with the rest of the
community: https://lore.kernel.org/kvm/20210806100710.2425336-1-hikalium@chromium.org/T/

There is an additional gap with both approaches, which is that time
advances when the VM is blocked because the VMM hasn't run KVM_RUN.
Ideally steal time would also include time where the VM isn't being
scheduled because of the VMM (but maybe only if the blocking is due to
something outside of the guest's control), so that the guest scheduler
doesn't punish tasks for it. But that's a completely different
discussion, and I should probably not even be mentioning that here.

I hope that helps give some background on these patches.
-- Suleiman
Sean Christopherson Jan. 17, 2025, 4:52 p.m. UTC | #7
On Fri, Jan 17, 2025, Suleiman Souhlal wrote:
> On Thu, Jan 16, 2025 at 6:49 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> > > It returns the cumulative nanoseconds that the host has been suspended.
> > > It is intended to be used for reporting host suspend time to the guest.
> >
> > ...
> >
> > >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > > +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> > > +{
> > > +     switch (state) {
> > > +     case PM_HIBERNATION_PREPARE:
> > > +     case PM_SUSPEND_PREPARE:
> > > +             last_suspend = ktime_get_boottime_ns();
> > > +     case PM_POST_HIBERNATION:
> > > +     case PM_POST_SUSPEND:
> > > +             total_suspend_ns += ktime_get_boottime_ns() - last_suspend;
> >
> > After spending too much time poking around kvmlock and sched_clock code, I'm pretty
> > sure that accounting *all* suspend time to steal_time is wildly inaccurate for
> > most clocksources that will be used by KVM x86 guests.
> >
> > KVM already adjusts TSC, and by extension kvmclock, to account for the TSC going
> > backwards due to suspend+resume.  I haven't dug super deep, buy I assume/hope the
> > majority of suspend time is handled by massaging guest TSC.
> >
> > There's still a notable gap, as KVM's TSC adjustments likely won't account for
> > the lag between CPUs coming online and vCPU's being restarted, but I don't know
> > that having KVM account the suspend duration is the right way to solve that issue.
> 
> (It is my understanding that steal time has no impact on clock sources.)
> On our machines, the problem isn't that the TSC is going backwards. As
> you say, kvmclock takes care of that.
> 
> The problem these patches are trying to solve is that the time keeps
> advancing for the VM while the host is suspended.

Right, the issue is that because KVM adjusts guest TSC if the host TSC does go
backwards, then the accounting will be all kinds of messed up.

  1. Initiate suspend at host TSC X, guest TSC X+Y.

  2. Save X into last_host_tsc via kvm_arch_vcpu_put():

	vcpu->arch.last_host_tsc = rdtsc();

  3. Resume after N hours, host TSC reset to 0 and starts counting.

  4. kvm_arch_enable_virtualization_cpu() runs at new host time Z.

  5. KVM detects backwards TSC (Z < X) and adjusts guest TSC offset so that guest
     TSC stays at/near X+Y, i.e. guest TSC becomes "Z + Y + (X - Z)".

		u64 delta_cyc = max_tsc - local_tsc;
		list_for_each_entry(kvm, &vm_list, vm_list) {
			kvm->arch.backwards_tsc_observed = true;
			kvm_for_each_vcpu(i, vcpu, kvm) {
				vcpu->arch.tsc_offset_adjustment += delta_cyc;
				vcpu->arch.last_host_tsc = local_tsc;
				kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
			}

Thus, if the guest is using the TSC for sched_clock, guest time does NOT keep
advancing.

kvmclock on the other hand counts from *host* boot, and so guest time keeps
advancing if the guest is using kvmclock.

  #ifdef CONFIG_X86_64
  static s64 get_kvmclock_base_ns(void)
  {
	/* Count up from boot time, but with the frequency of the raw clock.  */
	return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
  }
  #else
  static s64 get_kvmclock_base_ns(void)
  {
	/* Master clock not used, so we can just use CLOCK_BOOTTIME.  */
	return ktime_get_boottime_ns();
  }
  #endif

In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
Or maybe it's the other way around and effectively freezing guest TSC is super
problematic and fundamentally flawed.

Regardless of which one is "broken", unconditionally accounting suspend time to
steal_time will do the wrong thing when sched_clock=tsc.  To further muddy the
waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
but sched_clock=kvmclock.  In that scenario, guest time doesn't advanced, but
guest scheduler time does.  Ugh.

That particular wart can be avoided by having the guest use TSC for sched_clock[*],
e.g. so that at least the behavior of time is consistent.

Hmm, if freezing guest time across suspend is indeed problematic, one thought
would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
host may be suspending.  The Linux-as-a-guest would prefer kvmclock over TSC for
both clocksource and sched_clock.

[*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com
Suleiman Souhlal Jan. 21, 2025, 5:37 a.m. UTC | #8
On Sat, Jan 18, 2025 at 1:52 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jan 17, 2025, Suleiman Souhlal wrote:
> > On Thu, Jan 16, 2025 at 6:49 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Tue, Jan 07, 2025, Suleiman Souhlal wrote:
> > > > It returns the cumulative nanoseconds that the host has been suspended.
> > > > It is intended to be used for reporting host suspend time to the guest.
> > >
> > > ...
> > >
> > > >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> > > > +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
> > > > +{
> > > > +     switch (state) {
> > > > +     case PM_HIBERNATION_PREPARE:
> > > > +     case PM_SUSPEND_PREPARE:
> > > > +             last_suspend = ktime_get_boottime_ns();
> > > > +     case PM_POST_HIBERNATION:
> > > > +     case PM_POST_SUSPEND:
> > > > +             total_suspend_ns += ktime_get_boottime_ns() - last_suspend;
> > >
> > > After spending too much time poking around kvmlock and sched_clock code, I'm pretty
> > > sure that accounting *all* suspend time to steal_time is wildly inaccurate for
> > > most clocksources that will be used by KVM x86 guests.
> > >
> > > KVM already adjusts TSC, and by extension kvmclock, to account for the TSC going
> > > backwards due to suspend+resume.  I haven't dug super deep, buy I assume/hope the
> > > majority of suspend time is handled by massaging guest TSC.
> > >
> > > There's still a notable gap, as KVM's TSC adjustments likely won't account for
> > > the lag between CPUs coming online and vCPU's being restarted, but I don't know
> > > that having KVM account the suspend duration is the right way to solve that issue.
> >
> > (It is my understanding that steal time has no impact on clock sources.)
> > On our machines, the problem isn't that the TSC is going backwards. As
> > you say, kvmclock takes care of that.
> >
> > The problem these patches are trying to solve is that the time keeps
> > advancing for the VM while the host is suspended.
>
> Right, the issue is that because KVM adjusts guest TSC if the host TSC does go
> backwards, then the accounting will be all kinds of messed up.
>
>   1. Initiate suspend at host TSC X, guest TSC X+Y.
>
>   2. Save X into last_host_tsc via kvm_arch_vcpu_put():
>
>         vcpu->arch.last_host_tsc = rdtsc();
>
>   3. Resume after N hours, host TSC reset to 0 and starts counting.
>
>   4. kvm_arch_enable_virtualization_cpu() runs at new host time Z.
>
>   5. KVM detects backwards TSC (Z < X) and adjusts guest TSC offset so that guest
>      TSC stays at/near X+Y, i.e. guest TSC becomes "Z + Y + (X - Z)".
>
>                 u64 delta_cyc = max_tsc - local_tsc;
>                 list_for_each_entry(kvm, &vm_list, vm_list) {
>                         kvm->arch.backwards_tsc_observed = true;
>                         kvm_for_each_vcpu(i, vcpu, kvm) {
>                                 vcpu->arch.tsc_offset_adjustment += delta_cyc;
>                                 vcpu->arch.last_host_tsc = local_tsc;
>                                 kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>                         }
>
> Thus, if the guest is using the TSC for sched_clock, guest time does NOT keep
> advancing.
>
> kvmclock on the other hand counts from *host* boot, and so guest time keeps
> advancing if the guest is using kvmclock.
>
>   #ifdef CONFIG_X86_64
>   static s64 get_kvmclock_base_ns(void)
>   {
>         /* Count up from boot time, but with the frequency of the raw clock.  */
>         return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
>   }
>   #else
>   static s64 get_kvmclock_base_ns(void)
>   {
>         /* Master clock not used, so we can just use CLOCK_BOOTTIME.  */
>         return ktime_get_boottime_ns();
>   }
>   #endif
>
> In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
> Or maybe it's the other way around and effectively freezing guest TSC is super
> problematic and fundamentally flawed.
>
> Regardless of which one is "broken", unconditionally accounting suspend time to
> steal_time will do the wrong thing when sched_clock=tsc.  To further muddy the
> waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
> but sched_clock=kvmclock.  In that scenario, guest time doesn't advanced, but
> guest scheduler time does.  Ugh.
>
> That particular wart can be avoided by having the guest use TSC for sched_clock[*],
> e.g. so that at least the behavior of time is consistent.
>
> Hmm, if freezing guest time across suspend is indeed problematic, one thought
> would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
> host may be suspending.  The Linux-as-a-guest would prefer kvmclock over TSC for
> both clocksource and sched_clock.
>
> [*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com

I see what you're saying. Thanks for explaining.

To complicate things further there are also different kinds of
suspends. From what I've seen "shallow" (and/or "suspend-to-idle")
suspends don't stop the CPU, at least on our machines, so the host TSC
keeps ticking. On "deep" suspends, on the other hand, the TSC might go
backwards.

But I suppose if the guest uses kvmclock the behavior should be the
same in either case.

At least for our use case we would definitely want guest *wall* time
to keep advancing, so we would still want to use kvmclock.

Would accounting the suspend duration in steal time be acceptable if
it was conditional on the guest using kvmclock?
We would need a way for the host to be notified that the guest is
indeed using it, possibly by adding a new MSR to be written to in
kvm_cs_enable().

-- Suleiman
Sean Christopherson Jan. 21, 2025, 8:22 p.m. UTC | #9
On Tue, Jan 21, 2025, Suleiman Souhlal wrote:
> On Sat, Jan 18, 2025 at 1:52 AM Sean Christopherson <seanjc@google.com> wrote:
> > In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
> > Or maybe it's the other way around and effectively freezing guest TSC is super
> > problematic and fundamentally flawed.
> >
> > Regardless of which one is "broken", unconditionally accounting suspend time to
> > steal_time will do the wrong thing when sched_clock=tsc.  To further muddy the
> > waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
> > but sched_clock=kvmclock.  In that scenario, guest time doesn't advanced, but
> > guest scheduler time does.  Ugh.
> >
> > That particular wart can be avoided by having the guest use TSC for sched_clock[*],
> > e.g. so that at least the behavior of time is consistent.
> >
> > Hmm, if freezing guest time across suspend is indeed problematic, one thought
> > would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
> > host may be suspending.  The Linux-as-a-guest would prefer kvmclock over TSC for
> > both clocksource and sched_clock.
> >
> > [*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com
> 
> I see what you're saying. Thanks for explaining.
> 
> To complicate things further there are also different kinds of
> suspends. From what I've seen "shallow" (and/or "suspend-to-idle")
> suspends don't stop the CPU, at least on our machines, so the host TSC
> keeps ticking. On "deep" suspends, on the other hand, the TSC might go
> backwards.

Yeah, only S3 and lower will power down the CPU.  All bets are off if the CPU
doesn't have a nonstop TSC, but that's not at all unique to suspend, e.g. it's a
problem if the CPU goes idle, and so I think it's safe to only worry about CPUs
with nonstop TSC.

> But I suppose if the guest uses kvmclock the behavior should be the
> same in either case.
> 
> At least for our use case we would definitely want guest *wall* time
> to keep advancing, so we would still want to use kvmclock.
> 
> Would accounting the suspend duration in steal time be acceptable if
> it was conditional on the guest using kvmclock?
> We would need a way for the host to be notified that the guest is
> indeed using it,

And not just using kvmclock, but specifically using for sched_clock.  E.g. the
current behavior for most Linux guests on modern hardware is that they'll use TSC
for clocksource, but kvmclock for sched_clock and wall clock.

> possibly by adding a new MSR to be written to in
> kvm_cs_enable().

I don't think that's a good way forward.  I expect kvmclock to be largely
deprecated (guest side) in favor of raw TSC (with hardware-provided scale+offset),
at which point tying this to kvmclock puts us back at square one.

Given that s2idle and standby don't reset host TSC, I think the right way to
handle this conundrum is to address the flaw that's noted in the "backwards TSC"
logic, and adjust guest TSC to be fully up-to-date in the S3 (or lower) case.

	 * ......................................  Unfortunately, we can't
	 * bring the TSCs fully up to date with real time, as we aren't yet far
	 * enough into CPU bringup that we know how much real time has actually
	 * elapsed; our helper function, ktime_get_boottime_ns() will be using boot
	 * variables that haven't been updated yet.

I have no idea why commit 0dd6a6edb012 ("KVM: Dont mark TSC unstable due to S4
suspend") hooked kvm_arch_enable_virtualization_cpu() instead of implementing a
PM notifier, but I don't see anything that suggests it was deliberate, i.e. that
KVm *needs* to effectively snapshot guest TSC when onlining CPUs.

If that wart is fixed, then both kvmclock and TSC will account host suspend time,
and KVM can safely account the suspend time into steal time regardless of which
clock(s) the guest is using.
Suleiman Souhlal Feb. 4, 2025, 7:58 a.m. UTC | #10
On Wed, Jan 22, 2025 at 5:22 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jan 21, 2025, Suleiman Souhlal wrote:
> > On Sat, Jan 18, 2025 at 1:52 AM Sean Christopherson <seanjc@google.com> wrote:
> > > In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
> > > Or maybe it's the other way around and effectively freezing guest TSC is super
> > > problematic and fundamentally flawed.
> > >
> > > Regardless of which one is "broken", unconditionally accounting suspend time to
> > > steal_time will do the wrong thing when sched_clock=tsc.  To further muddy the
> > > waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
> > > but sched_clock=kvmclock.  In that scenario, guest time doesn't advanced, but
> > > guest scheduler time does.  Ugh.
> > >
> > > That particular wart can be avoided by having the guest use TSC for sched_clock[*],
> > > e.g. so that at least the behavior of time is consistent.
> > >
> > > Hmm, if freezing guest time across suspend is indeed problematic, one thought
> > > would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
> > > host may be suspending.  The Linux-as-a-guest would prefer kvmclock over TSC for
> > > both clocksource and sched_clock.
> > >
> > > [*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com
> >
> > I see what you're saying. Thanks for explaining.
> >
> > To complicate things further there are also different kinds of
> > suspends. From what I've seen "shallow" (and/or "suspend-to-idle")
> > suspends don't stop the CPU, at least on our machines, so the host TSC
> > keeps ticking. On "deep" suspends, on the other hand, the TSC might go
> > backwards.
>
> Yeah, only S3 and lower will power down the CPU.  All bets are off if the CPU
> doesn't have a nonstop TSC, but that's not at all unique to suspend, e.g. it's a
> problem if the CPU goes idle, and so I think it's safe to only worry about CPUs
> with nonstop TSC.
>
> > But I suppose if the guest uses kvmclock the behavior should be the
> > same in either case.
> >
> > At least for our use case we would definitely want guest *wall* time
> > to keep advancing, so we would still want to use kvmclock.
> >
> > Would accounting the suspend duration in steal time be acceptable if
> > it was conditional on the guest using kvmclock?
> > We would need a way for the host to be notified that the guest is
> > indeed using it,
>
> And not just using kvmclock, but specifically using for sched_clock.  E.g. the
> current behavior for most Linux guests on modern hardware is that they'll use TSC
> for clocksource, but kvmclock for sched_clock and wall clock.
>
> > possibly by adding a new MSR to be written to in
> > kvm_cs_enable().
>
> I don't think that's a good way forward.  I expect kvmclock to be largely
> deprecated (guest side) in favor of raw TSC (with hardware-provided scale+offset),
> at which point tying this to kvmclock puts us back at square one.
>
> Given that s2idle and standby don't reset host TSC, I think the right way to
> handle this conundrum is to address the flaw that's noted in the "backwards TSC"
> logic, and adjust guest TSC to be fully up-to-date in the S3 (or lower) case.
>
>          * ......................................  Unfortunately, we can't
>          * bring the TSCs fully up to date with real time, as we aren't yet far
>          * enough into CPU bringup that we know how much real time has actually
>          * elapsed; our helper function, ktime_get_boottime_ns() will be using boot
>          * variables that haven't been updated yet.
>
> I have no idea why commit 0dd6a6edb012 ("KVM: Dont mark TSC unstable due to S4
> suspend") hooked kvm_arch_enable_virtualization_cpu() instead of implementing a
> PM notifier, but I don't see anything that suggests it was deliberate, i.e. that
> KVm *needs* to effectively snapshot guest TSC when onlining CPUs.
>
> If that wart is fixed, then both kvmclock and TSC will account host suspend time,
> and KVM can safely account the suspend time into steal time regardless of which
> clock(s) the guest is using.

I tried your suggestion of moving this to a PM notifier and I found
that it's possible for VCPUs to run after resume but before the PM
notifier has been called, because the resume notifiers get called
after tasks are unfrozen. Unfortunately that means that if we were to
do that, guest TSCs could go backwards.

However, I think it should be possible to keep the existing backwards
guest TSC prevention code but also use a notifier that further adjusts
the guest TSCs to advance time on suspends where the TSC did go
backwards. This would make both s2idle and deep suspends behave the
same way.

-- Suleiman
Suleiman Souhlal Feb. 5, 2025, 5:55 a.m. UTC | #11
On Tue, Feb 4, 2025 at 4:58 PM Suleiman Souhlal <suleiman@google.com> wrote:
>
> On Wed, Jan 22, 2025 at 5:22 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Jan 21, 2025, Suleiman Souhlal wrote:
> > > On Sat, Jan 18, 2025 at 1:52 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > In short, AFAICT the issues you are observing are mostly a problem with kvmclock.
> > > > Or maybe it's the other way around and effectively freezing guest TSC is super
> > > > problematic and fundamentally flawed.
> > > >
> > > > Regardless of which one is "broken", unconditionally accounting suspend time to
> > > > steal_time will do the wrong thing when sched_clock=tsc.  To further muddy the
> > > > waters, current Linux-as-a-guest on modern hardware will likely use clocksource=tsc,
> > > > but sched_clock=kvmclock.  In that scenario, guest time doesn't advanced, but
> > > > guest scheduler time does.  Ugh.
> > > >
> > > > That particular wart can be avoided by having the guest use TSC for sched_clock[*],
> > > > e.g. so that at least the behavior of time is consistent.
> > > >
> > > > Hmm, if freezing guest time across suspend is indeed problematic, one thought
> > > > would be to put the onus on the VMM/user to not advertise a "nonstop TSC" if the
> > > > host may be suspending.  The Linux-as-a-guest would prefer kvmclock over TSC for
> > > > both clocksource and sched_clock.
> > > >
> > > > [*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com
> > >
> > > I see what you're saying. Thanks for explaining.
> > >
> > > To complicate things further there are also different kinds of
> > > suspends. From what I've seen "shallow" (and/or "suspend-to-idle")
> > > suspends don't stop the CPU, at least on our machines, so the host TSC
> > > keeps ticking. On "deep" suspends, on the other hand, the TSC might go
> > > backwards.
> >
> > Yeah, only S3 and lower will power down the CPU.  All bets are off if the CPU
> > doesn't have a nonstop TSC, but that's not at all unique to suspend, e.g. it's a
> > problem if the CPU goes idle, and so I think it's safe to only worry about CPUs
> > with nonstop TSC.
> >
> > > But I suppose if the guest uses kvmclock the behavior should be the
> > > same in either case.
> > >
> > > At least for our use case we would definitely want guest *wall* time
> > > to keep advancing, so we would still want to use kvmclock.
> > >
> > > Would accounting the suspend duration in steal time be acceptable if
> > > it was conditional on the guest using kvmclock?
> > > We would need a way for the host to be notified that the guest is
> > > indeed using it,
> >
> > And not just using kvmclock, but specifically using for sched_clock.  E.g. the
> > current behavior for most Linux guests on modern hardware is that they'll use TSC
> > for clocksource, but kvmclock for sched_clock and wall clock.
> >
> > > possibly by adding a new MSR to be written to in
> > > kvm_cs_enable().
> >
> > I don't think that's a good way forward.  I expect kvmclock to be largely
> > deprecated (guest side) in favor of raw TSC (with hardware-provided scale+offset),
> > at which point tying this to kvmclock puts us back at square one.
> >
> > Given that s2idle and standby don't reset host TSC, I think the right way to
> > handle this conundrum is to address the flaw that's noted in the "backwards TSC"
> > logic, and adjust guest TSC to be fully up-to-date in the S3 (or lower) case.
> >
> >          * ......................................  Unfortunately, we can't
> >          * bring the TSCs fully up to date with real time, as we aren't yet far
> >          * enough into CPU bringup that we know how much real time has actually
> >          * elapsed; our helper function, ktime_get_boottime_ns() will be using boot
> >          * variables that haven't been updated yet.
> >
> > I have no idea why commit 0dd6a6edb012 ("KVM: Dont mark TSC unstable due to S4
> > suspend") hooked kvm_arch_enable_virtualization_cpu() instead of implementing a
> > PM notifier, but I don't see anything that suggests it was deliberate, i.e. that
> > KVm *needs* to effectively snapshot guest TSC when onlining CPUs.
> >
> > If that wart is fixed, then both kvmclock and TSC will account host suspend time,
> > and KVM can safely account the suspend time into steal time regardless of which
> > clock(s) the guest is using.
>
> I tried your suggestion of moving this to a PM notifier and I found
> that it's possible for VCPUs to run after resume but before the PM
> notifier has been called, because the resume notifiers get called
> after tasks are unfrozen. Unfortunately that means that if we were to
> do that, guest TSCs could go backwards.
>
> However, I think it should be possible to keep the existing backwards
> guest TSC prevention code but also use a notifier that further adjusts
> the guest TSCs to advance time on suspends where the TSC did go
> backwards. This would make both s2idle and deep suspends behave the
> same way.

An alternative might be to block VCPUs from newly entering the guest
between the pre and post suspend notifiers.
Otherwise, some of the steal time accounting would have to be done in
kvm_arch_enable_virtualization_cpu(), to make sure it gets applied on
the first VCPU run, in case that happens before the resume notifier
would have fired. But the comment there says we can't call
ktime_get_boottime_ns() there, so maybe that's not possible.

-- Suleiman
Sean Christopherson Feb. 6, 2025, 1:29 a.m. UTC | #12
On Wed, Feb 05, 2025, Suleiman Souhlal wrote:
> On Tue, Feb 4, 2025 at 4:58 PM Suleiman Souhlal <suleiman@google.com> wrote:
> > > Given that s2idle and standby don't reset host TSC, I think the right way to
> > > handle this conundrum is to address the flaw that's noted in the "backwards TSC"
> > > logic, and adjust guest TSC to be fully up-to-date in the S3 (or lower) case.
> > >
> > >          * ......................................  Unfortunately, we can't
> > >          * bring the TSCs fully up to date with real time, as we aren't yet far
> > >          * enough into CPU bringup that we know how much real time has actually
> > >          * elapsed; our helper function, ktime_get_boottime_ns() will be using boot
> > >          * variables that haven't been updated yet.
> > >
> > > I have no idea why commit 0dd6a6edb012 ("KVM: Dont mark TSC unstable due to S4
> > > suspend") hooked kvm_arch_enable_virtualization_cpu() instead of implementing a
> > > PM notifier, but I don't see anything that suggests it was deliberate, i.e. that
> > > KVm *needs* to effectively snapshot guest TSC when onlining CPUs.
> > >
> > > If that wart is fixed, then both kvmclock and TSC will account host suspend time,
> > > and KVM can safely account the suspend time into steal time regardless of which
> > > clock(s) the guest is using.
> >
> > I tried your suggestion of moving this to a PM notifier and I found
> > that it's possible for VCPUs to run after resume but before the PM
> > notifier has been called, because the resume notifiers get called
> > after tasks are unfrozen. Unfortunately that means that if we were to
> > do that, guest TSCs could go backwards.

Ugh.  That explains why KVM hooks the CPU online path.

> > However, I think it should be possible to keep the existing backwards
> > guest TSC prevention code but also use a notifier that further adjusts
> > the guest TSCs to advance time on suspends where the TSC did go
> > backwards. This would make both s2idle and deep suspends behave the
> > same way.
> 
> An alternative might be to block VCPUs from newly entering the guest
> between the pre and post suspend notifiers.
> Otherwise, some of the steal time accounting would have to be done in
> kvm_arch_enable_virtualization_cpu(), to make sure it gets applied on
> the first VCPU run, in case that happens before the resume notifier
> would have fired. But the comment there says we can't call
> ktime_get_boottime_ns() there, so maybe that's not possible.

I don't think the PM notifier approach is viable.  It's simply too late.  Without
a hook in CPU online, KVM can't even tell which VMs/vCPUs were running before the
suspend, i.e. which VMs need to be updated.

One idea would be to simply fast forward guest TSC to current time when the vCPU
is loaded after suspend+resume.  E.g. this hack appears to work.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dcd0c12c308e..27b25f8ca413 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4824,7 +4824,16 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
        /* Apply any externally detected TSC adjustments (due to suspend) */
        if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
-               adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
+               u64 kernel_ns, tsc_now;
+
+               if (kvm_get_time_and_clockread(&kernel_ns, &tsc_now)) {
+                       u64 l1_tsc = nsec_to_cycles(vcpu, vcpu->kvm->arch.kvmclock_offset + kernel_ns);
+                       u64 offset = kvm_compute_l1_tsc_offset(vcpu, l1_tsc);
+
+                       kvm_vcpu_write_tsc_offset(vcpu, offset);
+               } else {
+                       adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
+               }
                vcpu->arch.tsc_offset_adjustment = 0;
                kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
        }
Suleiman Souhlal Feb. 13, 2025, 3:56 a.m. UTC | #13
On Thu, Feb 6, 2025 at 10:29 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Feb 05, 2025, Suleiman Souhlal wrote:
> > >
> > > I tried your suggestion of moving this to a PM notifier and I found
> > > that it's possible for VCPUs to run after resume but before the PM
> > > notifier has been called, because the resume notifiers get called
> > > after tasks are unfrozen. Unfortunately that means that if we were to
> > > do that, guest TSCs could go backwards.
>
> Ugh.  That explains why KVM hooks the CPU online path.
>
> > > However, I think it should be possible to keep the existing backwards
> > > guest TSC prevention code but also use a notifier that further adjusts
> > > the guest TSCs to advance time on suspends where the TSC did go
> > > backwards. This would make both s2idle and deep suspends behave the
> > > same way.
> >
> > An alternative might be to block VCPUs from newly entering the guest
> > between the pre and post suspend notifiers.
> > Otherwise, some of the steal time accounting would have to be done in
> > kvm_arch_enable_virtualization_cpu(), to make sure it gets applied on
> > the first VCPU run, in case that happens before the resume notifier
> > would have fired. But the comment there says we can't call
> > ktime_get_boottime_ns() there, so maybe that's not possible.
>
> I don't think the PM notifier approach is viable.  It's simply too late.  Without
> a hook in CPU online, KVM can't even tell which VMs/vCPUs were running before the
> suspend, i.e. which VMs need to be updated.
>
> One idea would be to simply fast forward guest TSC to current time when the vCPU
> is loaded after suspend+resume.  E.g. this hack appears to work.

That's really interesting!
One possible concern with this approach is that now each VCPU gets a
slightly different offset.
It's possible to avoid that by making sure that only one VCPU computes
the offset and having the other ones reuse it.
I'll send a patch. Should it be "Suggested-by:" you or is there
something else I should do?

I also wasn't sure if we should do this when the user sets the offset
with KVM_VCPU_TSC_OFFSET. I guess we probably should.

-- Suleiman
diff mbox series

Patch

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 401439bb21e3e6..cf926168b30820 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2553,4 +2553,6 @@  long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range);
 #endif
 
+u64 kvm_total_suspend_ns(void);
+
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index de2c11dae23163..d5ae237df76d0d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -889,13 +889,39 @@  static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */
 
+static u64 last_suspend;
+static u64 total_suspend_ns;
+
+u64 kvm_total_suspend_ns(void)
+{
+	return total_suspend_ns;
+}
+
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
+static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
+{
+	switch (state) {
+	case PM_HIBERNATION_PREPARE:
+	case PM_SUSPEND_PREPARE:
+		last_suspend = ktime_get_boottime_ns();
+	case PM_POST_HIBERNATION:
+	case PM_POST_SUSPEND:
+		total_suspend_ns += ktime_get_boottime_ns() - last_suspend;
+	}
+
+	return NOTIFY_DONE;
+}
+
 static int kvm_pm_notifier_call(struct notifier_block *bl,
 				unsigned long state,
 				void *unused)
 {
 	struct kvm *kvm = container_of(bl, struct kvm, pm_notifier);
 
+	if (kvm_pm_notifier(kvm, state) != NOTIFY_DONE)
+		return NOTIFY_BAD;
+
 	return kvm_arch_pm_notifier(kvm, state);
 }