diff mbox series

[v2] KVM: x86: Use fast path for Xen timer delivery

Message ID a3989e7ff9cca77f680f9bdfbaee52b707693221.camel@infradead.org (mailing list archive)
State New, archived
Headers show
Series [v2] KVM: x86: Use fast path for Xen timer delivery | expand

Commit Message

David Woodhouse Sept. 29, 2023, 11:36 a.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

Most of the time there's no need to kick the vCPU and deliver the timer
event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast()
directly from the timer callback, and only fall back to the slow path
when it's necessary to do so.

This gives a significant improvement in timer latency testing (using
nanosleep() for various periods and then measuring the actual time
elapsed).

However, there was a reason¹ the fast path was dropped when this support
was first added. The current code holds vcpu->mutex for all operations on
the kvm->arch.timer_expires field, and the fast path introduces potential
race conditions. So... ensure the hrtimer is *cancelled* before making
changes in kvm_xen_start_timer(), and also when reading the values out
for KVM_XEN_VCPU_ATTR_TYPE_TIMER.

Add some sanity checks to ensure the truth of the claim that all the
other code paths are run with the vcpu loaded. And use hrtimer_cancel()
directly from kvm_xen_destroy_vcpu() to avoid a false positive from the
check in kvm_xen_stop_timer().

¹ https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@redhat.com/

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---

 • v2: Remember, and deal with, those races.

 arch/x86/kvm/xen.c | 64 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 58 insertions(+), 6 deletions(-)

Comments

Paul Durrant Sept. 29, 2023, 1:26 p.m. UTC | #1
On 29/09/2023 12:36, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Most of the time there's no need to kick the vCPU and deliver the timer
> event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast()
> directly from the timer callback, and only fall back to the slow path
> when it's necessary to do so.
> 
> This gives a significant improvement in timer latency testing (using
> nanosleep() for various periods and then measuring the actual time
> elapsed).
> 
> However, there was a reason¹ the fast path was dropped when this support
> was first added. The current code holds vcpu->mutex for all operations on
> the kvm->arch.timer_expires field, and the fast path introduces potential
> race conditions. So... ensure the hrtimer is *cancelled* before making
> changes in kvm_xen_start_timer(), and also when reading the values out
> for KVM_XEN_VCPU_ATTR_TYPE_TIMER.
> 
> Add some sanity checks to ensure the truth of the claim that all the
> other code paths are run with the vcpu loaded. And use hrtimer_cancel()
> directly from kvm_xen_destroy_vcpu() to avoid a false positive from the
> check in kvm_xen_stop_timer().
> 
> ¹ https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@redhat.com/
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> 
>   • v2: Remember, and deal with, those races.
> 
>   arch/x86/kvm/xen.c | 64 +++++++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 58 insertions(+), 6 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>
Sean Christopherson Sept. 29, 2023, 3:16 p.m. UTC | #2
On Fri, Sep 29, 2023, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Most of the time there's no need to kick the vCPU and deliver the timer
> event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast()
> directly from the timer callback, and only fall back to the slow path
> when it's necessary to do so.

It'd be helpful for non-Xen folks to explain "when it's necessary".  IIUC, the
only time it's necessary is if the gfn=>pfn cache isn't valid/fresh.

> This gives a significant improvement in timer latency testing (using
> nanosleep() for various periods and then measuring the actual time
> elapsed).
> 
> However, there was a reason¹ the fast path was dropped when this support

Heh, please use [1] or [*] like everyone else.  I can barely see that tiny little ¹.

> was first added. The current code holds vcpu->mutex for all operations on
> the kvm->arch.timer_expires field, and the fast path introduces potential
> race conditions. So... ensure the hrtimer is *cancelled* before making
> changes in kvm_xen_start_timer(), and also when reading the values out
> for KVM_XEN_VCPU_ATTR_TYPE_TIMER.
> 
> Add some sanity checks to ensure the truth of the claim that all the
> other code paths are run with the vcpu loaded.  And use hrtimer_cancel()
> directly from kvm_xen_destroy_vcpu() to avoid a false positive from the
> check in kvm_xen_stop_timer().

This should really be at least 2 patches, probably 3:

  1. add the assertions and the destroy_vcpu() change
  2. cancel the timer before starting a new one or reading the value from userspace
  3. use the fastpath delivery from the timer callback

> ¹ https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@redhat.com/
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> 
>  • v2: Remember, and deal with, those races.
> 
>  arch/x86/kvm/xen.c | 64 +++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 58 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index fb1110b2385a..9d0d602a2466 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -117,6 +117,8 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
>  
>  void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu)
>  {
> +	WARN_ON_ONCE(vcpu != kvm_get_running_vcpu());
> +
>  	if (atomic_read(&vcpu->arch.xen.timer_pending) > 0) {
>  		struct kvm_xen_evtchn e;
>  
> @@ -136,18 +138,41 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
>  {
>  	struct kvm_vcpu *vcpu = container_of(timer, struct kvm_vcpu,
>  					     arch.xen.timer);
> +	struct kvm_xen_evtchn e;
> +	int rc;
> +
>  	if (atomic_read(&vcpu->arch.xen.timer_pending))
>  		return HRTIMER_NORESTART;
>  
> -	atomic_inc(&vcpu->arch.xen.timer_pending);
> -	kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
> -	kvm_vcpu_kick(vcpu);
> +	e.vcpu_id = vcpu->vcpu_id;
> +	e.vcpu_idx = vcpu->vcpu_idx;
> +	e.port = vcpu->arch.xen.timer_virq;
> +	e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
> +
> +	rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm);
> +	if (rc == -EWOULDBLOCK) {
> +		atomic_inc(&vcpu->arch.xen.timer_pending);
> +		kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
> +		kvm_vcpu_kick(vcpu);
> +	} else {
> +		vcpu->arch.xen.timer_expires = 0;

Eww.  IIUC, timer_expires isn't cleared so that the pending IRQ is captured by
kvm_xen_vcpu_get_attr(), i.e. because xen.timer_pending itself isn't migrated.

> +	}
>  
>  	return HRTIMER_NORESTART;
>  }
>  
>  static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
>  {
> +	WARN_ON_ONCE(vcpu != kvm_get_running_vcpu());
> +
> +	/*
> +	 * Avoid races with the old timer firing. Checking timer_expires
> +	 * to avoid calling hrtimer_cancel() will only have false positives
> +	 * so is fine.
> +	 */
> +	if (vcpu->arch.xen.timer_expires)
> +		hrtimer_cancel(&vcpu->arch.xen.timer);
> +
>  	atomic_set(&vcpu->arch.xen.timer_pending, 0);
>  	vcpu->arch.xen.timer_expires = guest_abs;
>  
> @@ -163,6 +188,8 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_
>  
>  static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu)
>  {
> +	WARN_ON_ONCE(vcpu != kvm_get_running_vcpu());
> +
>  	hrtimer_cancel(&vcpu->arch.xen.timer);
>  	vcpu->arch.xen.timer_expires = 0;
>  	atomic_set(&vcpu->arch.xen.timer_pending, 0);
> @@ -1019,13 +1046,38 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
>  		r = 0;
>  		break;
>  
> -	case KVM_XEN_VCPU_ATTR_TYPE_TIMER:
> +	case KVM_XEN_VCPU_ATTR_TYPE_TIMER: {
> +		bool pending = false;
> +
> +		/*
> +		 * Ensure a consistent snapshot of state is captures, with a

s/captures/captured

> +		 * timer either being pending, or fully delivered. Not still

Kind of a nit, but IMO "fully delivered" isn't accurate, at least not without
more clarification.  I would considered "fully delivered" to mean that the IRQ
has caused the guest to start executing its IRQ handler.  Maybe "fully queued in
the event channel"?

> +		 * lurking in the timer_pending flag for deferred delivery.
> +		 */
> +		if (vcpu->arch.xen.timer_expires) {
> +			pending = hrtimer_cancel(&vcpu->arch.xen.timer);
> +			kvm_xen_inject_timer_irqs(vcpu);

If the goal is to not have pending timers, then kvm_xen_inject_timer_irqs()
should be called unconditionally after canceling the hrtimer, no?

> +		}
> +
>  		data->u.timer.port = vcpu->arch.xen.timer_virq;
>  		data->u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
>  		data->u.timer.expires_ns = vcpu->arch.xen.timer_expires;
> +
> +		/*
> +		 * The timer may be delivered immediately, while the returned
> +		 * state causes it to be set up and delivered again on the

Similar to the "fully delivered" stuff above, maybe s/timer/hrtimer to make it
a bit more clear that the host hrtimer can fire twice, but it won't ever result
in two timer IRQs being delivered from the guest's perspective.

> +		 * destination system after migration. That's fine, as the
> +		 * guest will not even have had a chance to run and process
> +		 * the interrupt by that point, so it won't even notice the
> +		 * duplicate IRQ.

Rather than say "so it won't even notice the duplicate IRQ", maybe be more explicit
and say "so the queued IRQ is guaranteed to be coalesced in the event channel
and/or guest local APIC".  Because I read the whole "delivered IRQ" stuff as there
really being two injected IRQs into the guest.

FWIW, this is all really gross, but I agree that even if the queued IRQ makes it
all the way to the guest's APIC, the IRQs will still be coalesced in the end.


> +		 */
> +		if (pending)
> +			hrtimer_start_expires(&vcpu->arch.xen.timer,
> +					      HRTIMER_MODE_ABS_HARD);
> +
>  		r = 0;
>  		break;
> -
> +	}
>  	case KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR:
>  		data->u.vector = vcpu->arch.xen.upcall_vector;
>  		r = 0;
> @@ -2085,7 +2137,7 @@ void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu)
>  void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu)
>  {
>  	if (kvm_xen_timer_enabled(vcpu))

IIUC, this can more precisely be:

	if (vcpu->arch.xen.timer_expires)
		hrtimer_cancel(&vcpu->arch.xen.timer);

at which point it might make sense to add a small helper

  static void kvm_xen_cancel_timer(struct kvm_vcpu *vcpu)
  {
	if (vcpu->arch.xen.timer_expires)
		hrtimer_cancel(&vcpu->arch.xen.timer);
  }

to share code with 

> -		kvm_xen_stop_timer(vcpu);
> +		hrtimer_cancel(&vcpu->arch.xen.timer);
>  
>  	kvm_gpc_deactivate(&vcpu->arch.xen.runstate_cache);
>  	kvm_gpc_deactivate(&vcpu->arch.xen.runstate2_cache);
> -- 
> 2.40.1
> 
>
David Woodhouse Sept. 29, 2023, 9:25 p.m. UTC | #3
On Fri, 2023-09-29 at 08:16 -0700, Sean Christopherson wrote:
> On Fri, Sep 29, 2023, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > Most of the time there's no need to kick the vCPU and deliver the timer
> > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast()
> > directly from the timer callback, and only fall back to the slow path
> > when it's necessary to do so.
> 
> It'd be helpful for non-Xen folks to explain "when it's necessary".  IIUC, the
> only time it's necessary is if the gfn=>pfn cache isn't valid/fresh.

That's an implementation detail. Like all of the fast path functions
that can be called from kvm_arch_set_irq_inatomic(), it has its own
criteria for why it might return -EWOULDBLOCK or not. Those are *its*
business. And in fact one of Paul's current patches is tweaking them
subtly, but that isn't relevant here. (But yes, you are broadly correct
in your understanding.)

> > This gives a significant improvement in timer latency testing (using
> > nanosleep() for various periods and then measuring the actual time
> > elapsed).
> > 
> > However, there was a reason¹ the fast path was dropped when this support
> 
> Heh, please use [1] or [*] like everyone else.  I can barely see that tiny little ¹.

Isn't that the *point*? The reference to the footnote isn't supposed to
detract from the flow of the main text. It's exactly how you'll see it
when typeset properly. I've always assumed the people using [1] or [*]
just haven't yet realised that it's the 21st century and we are no
longer limited to 7-bit ASCII. Or haven't worked out how to type
anything but ASCII.

> > was first added. The current code holds vcpu->mutex for all operations on
> > the kvm->arch.timer_expires field, and the fast path introduces potential
> > race conditions. So... ensure the hrtimer is *cancelled* before making
> > changes in kvm_xen_start_timer(), and also when reading the values out
> > for KVM_XEN_VCPU_ATTR_TYPE_TIMER.
> > 
> > Add some sanity checks to ensure the truth of the claim that all the
> > other code paths are run with the vcpu loaded.  And use hrtimer_cancel()
> > directly from kvm_xen_destroy_vcpu() to avoid a false positive from the
> > check in kvm_xen_stop_timer().
> 
> This should really be at least 2 patches, probably 3:
> 
>   1. add the assertions and the destroy_vcpu() change
>   2. cancel the timer before starting a new one or reading the value from userspace
>   3. use the fastpath delivery from the timer callback

Hm, I think that's borderline. I pondered it and came to the opposite
conclusion. Cancelling the timer wasn't needed without the fastpath
delivery; it isn't a separate fix. You could consider it a preparatory
patch I suppose... but I didn't. It's not like adding the fast path in
itself is complex enough that the other parts need to be broken out. 

> > ¹ https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@redhat.com/
> > 
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> > 
> >  • v2: Remember, and deal with, those races.
> > 
> >  arch/x86/kvm/xen.c | 64 +++++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 58 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> > index fb1110b2385a..9d0d602a2466 100644
> > --- a/arch/x86/kvm/xen.c
> > +++ b/arch/x86/kvm/xen.c
> > @@ -117,6 +117,8 @@ static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
> >  
> >  void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu)
> >  {
> > +       WARN_ON_ONCE(vcpu != kvm_get_running_vcpu());
> > +
> >         if (atomic_read(&vcpu->arch.xen.timer_pending) > 0) {
> >                 struct kvm_xen_evtchn e;
> >  
> > @@ -136,18 +138,41 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
> >  {
> >         struct kvm_vcpu *vcpu = container_of(timer, struct kvm_vcpu,
> >                                              arch.xen.timer);
> > +       struct kvm_xen_evtchn e;
> > +       int rc;
> > +
> >         if (atomic_read(&vcpu->arch.xen.timer_pending))
> >                 return HRTIMER_NORESTART;
> >  
> > -       atomic_inc(&vcpu->arch.xen.timer_pending);
> > -       kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
> > -       kvm_vcpu_kick(vcpu);
> > +       e.vcpu_id = vcpu->vcpu_id;
> > +       e.vcpu_idx = vcpu->vcpu_idx;
> > +       e.port = vcpu->arch.xen.timer_virq;
> > +       e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
> > +
> > +       rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm);
> > +       if (rc == -EWOULDBLOCK) {
> > +               atomic_inc(&vcpu->arch.xen.timer_pending);
> > +               kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
> > +               kvm_vcpu_kick(vcpu);
> > +       } else {
> > +               vcpu->arch.xen.timer_expires = 0;
> 
> Eww.  IIUC, timer_expires isn't cleared so that the pending IRQ is captured by
> kvm_xen_vcpu_get_attr(), i.e. because xen.timer_pending itself isn't migrated.

-EPARSE. Er... yes?

The timer_expires field is non-zero when there's a pending timer. When
the timer interrupt has fired, the time is no longer pending and the
timer_expires field is set to zero.

The xen.timer_pending field is indeed not migrated. We *flush* it in
kvm_xen_vcpu_get_attr(). It's now only used for the *deferral* to the
slow path.

So... yes, timer_expires *wasn't* previously cleared, because the timer
IRQ hadn't been delivered yet, and yes that was to avoid races with
kvm_xen_vcpu_get_attr(). Partly because the timer_pending field was
internal and not migrated. As far as userspace is concerned, the timer
has either fired, or it has not. Implementation details of the
timer_pending field — and the per-vcpu evtchn_pending_sel — are not
exposed.

> > +       }
> >  
> >         return HRTIMER_NORESTART;
> >  }
> >  
> >  static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
> >  {
> > +       WARN_ON_ONCE(vcpu != kvm_get_running_vcpu());
> > +
> > +       /*
> > +        * Avoid races with the old timer firing. Checking timer_expires
> > +        * to avoid calling hrtimer_cancel() will only have false positives
> > +        * so is fine.
> > +        */
> > +       if (vcpu->arch.xen.timer_expires)
> > +               hrtimer_cancel(&vcpu->arch.xen.timer);
> > +
> >         atomic_set(&vcpu->arch.xen.timer_pending, 0);
> >         vcpu->arch.xen.timer_expires = guest_abs;
> >  
> > @@ -163,6 +188,8 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_
> >  
> >  static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu)
> >  {
> > +       WARN_ON_ONCE(vcpu != kvm_get_running_vcpu());
> > +
> >         hrtimer_cancel(&vcpu->arch.xen.timer);
> >         vcpu->arch.xen.timer_expires = 0;
> >         atomic_set(&vcpu->arch.xen.timer_pending, 0);
> > @@ -1019,13 +1046,38 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
> >                 r = 0;
> >                 break;
> >  
> > -       case KVM_XEN_VCPU_ATTR_TYPE_TIMER:
> > +       case KVM_XEN_VCPU_ATTR_TYPE_TIMER: {
> > +               bool pending = false;
> > +
> > +               /*
> > +                * Ensure a consistent snapshot of state is captures, with a
> 
> s/captures/captured

Ack.

> > +                * timer either being pending, or fully delivered. Not still
> 
> Kind of a nit, but IMO "fully delivered" isn't accurate, at least not without
> more clarification.  I would considered "fully delivered" to mean that the IRQ
> has caused the guest to start executing its IRQ handler.  Maybe "fully queued in
> the event channel"?

OK, I'll reword.


> > +                * lurking in the timer_pending flag for deferred delivery.
> > +                */
> > +               if (vcpu->arch.xen.timer_expires) {
> > +                       pending = hrtimer_cancel(&vcpu->arch.xen.timer);
> > +                       kvm_xen_inject_timer_irqs(vcpu);
> 
> If the goal is to not have pending timers, then kvm_xen_inject_timer_irqs()
> should be called unconditionally after canceling the hrtimer, no?

It *is* called unconditionally after cancelling the hrtimer.

Or did you mean unconditionally whether we cancel the hrtimer or not?
The comment explains the logic for not needing that. If *either* the
timer is still active, *or* it's already fired and has taken the slow
path and set the timer_pending flag, then timer_expires won't have been
zeroed yet. So the race conditions inherent in doing this conditional
on vcpu->arch.xen.timer_expires are fine because there are only false
positives (which cause us to cancel a timer, and try to inject a
pending IRQ, which wasn't running and wasn't pending respectively).

Sounds like I need to expand that comment? 

> > +               }
> > +
> >                 data->u.timer.port = vcpu->arch.xen.timer_virq;
> >                 data->u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
> >                 data->u.timer.expires_ns = vcpu->arch.xen.timer_expires;
> > +
> > +               /*
> > +                * The timer may be delivered immediately, while the returned
> > +                * state causes it to be set up and delivered again on the
> 
> Similar to the "fully delivered" stuff above, maybe s/timer/hrtimer to make it
> a bit more clear that the host hrtimer can fire twice, but it won't ever result
> in two timer IRQs being delivered from the guest's perspective.

Ack.

> > +                * destination system after migration. That's fine, as the
> > +                * guest will not even have had a chance to run and process
> > +                * the interrupt by that point, so it won't even notice the
> > +                * duplicate IRQ.
> 
> Rather than say "so it won't even notice the duplicate IRQ", maybe be more explicit
> and say "so the queued IRQ is guaranteed to be coalesced in the event channel
> and/or guest local APIC".  Because I read the whole "delivered IRQ" stuff as there
> really being two injected IRQs into the guest.
> 
> FWIW, this is all really gross, but I agree that even if the queued IRQ makes it
> all the way to the guest's APIC, the IRQs will still be coalesced in the end.
> 

As discussed before, we *could* have made fetching the timer attr also
*pause* the timer. It just seemed like extra complexity for no good
reason. The shared info page containing the event channel bitmap has to
be migrated after the vCPU state anyway.


> 
> > +                */
> > +               if (pending)
> > +                       hrtimer_start_expires(&vcpu->arch.xen.timer,
> > +                                             HRTIMER_MODE_ABS_HARD);
> > +
> >                 r = 0;
> >                 break;
> > -
> > +       }
> >         case KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR:
> >                 data->u.vector = vcpu->arch.xen.upcall_vector;
> >                 r = 0;
> > @@ -2085,7 +2137,7 @@ void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu)
> >  void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu)
> >  {
> >         if (kvm_xen_timer_enabled(vcpu))
> 
> IIUC, this can more precisely be:
> 
>         if (vcpu->arch.xen.timer_expires)
>                 hrtimer_cancel(&vcpu->arch.xen.timer);
> 
> at which point it might make sense to add a small helper
> 
>   static void kvm_xen_cancel_timer(struct kvm_vcpu *vcpu)
>   {
>         if (vcpu->arch.xen.timer_expires)
>                 hrtimer_cancel(&vcpu->arch.xen.timer);
>   }
> 
> to share code with 
> 
> > -               kvm_xen_stop_timer(vcpu);
> > +               hrtimer_cancel(&vcpu->arch.xen.timer);
> >  
> >         kvm_gpc_deactivate(&vcpu->arch.xen.runstate_cache);
> >         kvm_gpc_deactivate(&vcpu->arch.xen.runstate2_cache);

In fact if I make the helper return a bool, I think I can use it
*three* times. At a cost of having a third function
kvm_xen_cancel_timer() alongside the existing kvm_xen_start_timer() and
kvm_xen_stop_timer(), and those *three* now make for a slightly
confusing set. I'll take a look and see how it makes me feel.
Sean Christopherson Oct. 2, 2023, 5:41 p.m. UTC | #4
On Fri, Sep 29, 2023, David Woodhouse wrote:
> On Fri, 2023-09-29 at 08:16 -0700, Sean Christopherson wrote:
> > On Fri, Sep 29, 2023, David Woodhouse wrote:
> > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > 
> > > Most of the time there's no need to kick the vCPU and deliver the timer
> > > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast()
> > > directly from the timer callback, and only fall back to the slow path
> > > when it's necessary to do so.
> > 
> > It'd be helpful for non-Xen folks to explain "when it's necessary".  IIUC, the
> > only time it's necessary is if the gfn=>pfn cache isn't valid/fresh.
> 
> That's an implementation detail.

And?  The target audience of changelogs are almost always people that care about
the implementation.

> Like all of the fast path functions that can be called from
> kvm_arch_set_irq_inatomic(), it has its own criteria for why it might return
> -EWOULDBLOCK or not. Those are *its* business.

And all of the KVM code is the business of the people who contribute to the kernel,
now and in the future.  Yeah, there's a small chance that a detailed changelog can
become stale if the patch races with some other in-flight change, but even *that*
is a useful data point.  E.g. if Paul's patches somehow broke/degraded this code,
then knowing that what the author (you) intended/observed didn't match reality when
the patch was applied would be extremely useful information for whoever encountered
the hypothetical breakage.

> And in fact one of Paul's current patches is tweaking them subtly, but that
> isn't relevant here. (But yes, you are broadly correct in your
> understanding.)
> 
> > > This gives a significant improvement in timer latency testing (using
> > > nanosleep() for various periods and then measuring the actual time
> > > elapsed).
> > > 
> > > However, there was a reason¹ the fast path was dropped when this support
> > 
> > Heh, please use [1] or [*] like everyone else.  I can barely see that tiny little ¹.
> 
> Isn't that the *point*? The reference to the footnote isn't supposed to
> detract from the flow of the main text. It's exactly how you'll see it
> when typeset properly.
 
Footnotes that are "typeset properly" have the entire footnote in a different
font+size.  A tiny number next to normal sized text just looks weird to me.

And I often do a "reverse lookup" when I get to footnotes that are links, e.g. to
gauge whether or not it's worth my time to follow the link.  Trying to find the
tiny ¹ via a quick visual scan is an exercise in frustration, at least for the
monospace font I use for reading mail, e.g. it's much more readable on my end in
an editor using a different font.

Which is a big benefit to sticking to the old and kludgly ASCII: it provides a
fairly consistent experience regardless of what client/font/etc each reader is
using.  I'm not completely against using unicode characters, e.g. for names with
characters not found in the Latin alphabet, but for code and things like this,
IMO simpler is better.

> I've always assumed the people using [1] or [*] just haven't yet realised
> that it's the 21st century and we are no longer limited to 7-bit ASCII. Or
> haven't worked out how to type anything but ASCII.

Please don't devolve into ad hominem attacks against other reviews and contributors.
If you want to argue that using footnote notation unicode is superior in some way,
then by all means, present your arguments.
David Woodhouse Oct. 2, 2023, 6:09 p.m. UTC | #5
On Mon, 2023-10-02 at 10:41 -0700, Sean Christopherson wrote:
> On Fri, Sep 29, 2023, David Woodhouse wrote:
> > On Fri, 2023-09-29 at 08:16 -0700, Sean Christopherson wrote:
> > > On Fri, Sep 29, 2023, David Woodhouse wrote:
> > > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > > 
> > > > Most of the time there's no need to kick the vCPU and deliver the timer
> > > > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast()
> > > > directly from the timer callback, and only fall back to the slow path
> > > > when it's necessary to do so.
> > > 
> > > It'd be helpful for non-Xen folks to explain "when it's necessary".  IIUC, the
> > > only time it's necessary is if the gfn=>pfn cache isn't valid/fresh.
> > 
> > That's an implementation detail.
> 
> And?  The target audience of changelogs are almost always people that care about
> the implementation.
>
> > Like all of the fast path functions that can be called from
> > kvm_arch_set_irq_inatomic(), it has its own criteria for why it might return
> > -EWOULDBLOCK or not. Those are *its* business.
> 
> And all of the KVM code is the business of the people who contribute to the kernel,
> now and in the future.  Yeah, there's a small chance that a detailed changelog can
> become stale if the patch races with some other in-flight change, but even *that*
> is a useful data point.  E.g. if Paul's patches somehow broke/degraded this code,
> then knowing that what the author (you) intended/observed didn't match reality when
> the patch was applied would be extremely useful information for whoever encountered
> the hypothetical breakage.

Fair enough, but on this occasion it truly doesn't matter. It has
nothing to do with the implementation of *this* patch. This code makes
no assumptions and has no dependency on *when* that fast path might
return -EWOULDBLOCK. Sometimes it does, sometimes it doesn't. This code
just doesn't care one iota.

If this code had *dependencies* on the precise behaviour of
kvm_xen_set_evtchn_fast() that we needed to reason about, then sure,
I'd have written those explicitly into the commit comment *and* tried
to find some way of enforcing them with runtime warnings etc.

But it doesn't. So I am no more inclined to document the precise
behaviour of kvm_xen_set_evtchn_fast() in a patch which just happens to
call it, than I am inclined to document hrtimer_cancel() or any other
function called from the new code :)

> > And in fact one of Paul's current patches is tweaking them subtly, but that
> > isn't relevant here. (But yes, you are broadly correct in your
> > understanding.)
> > 
> > > > This gives a significant improvement in timer latency testing (using
> > > > nanosleep() for various periods and then measuring the actual time
> > > > elapsed).
> > > > 
> > > > However, there was a reason¹ the fast path was dropped when this support
> > > 
> > > Heh, please use [1] or [*] like everyone else.  I can barely see that tiny little ¹.
> > 
> > Isn't that the *point*? The reference to the footnote isn't supposed to
> > detract from the flow of the main text. It's exactly how you'll see it
> > when typeset properly.
>  
> Footnotes that are "typeset properly" have the entire footnote in a different
> font+size.  A tiny number next to normal sized text just looks weird to me.
> 
> And I often do a "reverse lookup" when I get to footnotes that are links, e.g. to
> gauge whether or not it's worth my time to follow the link.  Trying to find the
> tiny ¹ via a quick visual scan is an exercise in frustration, at least for the
> monospace font I use for reading mail, e.g. it's much more readable on my end in
> an editor using a different font.
> 
> Which is a big benefit to sticking to the old and kludgly ASCII: it provides a
> fairly consistent experience regardless of what client/font/etc each reader is
> using.  I'm not completely against using unicode characters, e.g. for names with
> characters not found in the Latin alphabet, but for code and things like this,
> IMO simpler is better.
> 
> > I've always assumed the people using [1] or [*] just haven't yet realised
> > that it's the 21st century and we are no longer limited to 7-bit ASCII. Or
> > haven't worked out how to type anything but ASCII.
> 
> Please don't devolve into ad hominem attacks against other reviews and contributors.
> If you want to argue that using footnote notation unicode is superior in some way,
> then by all means, present your arguments.

Hey, you started the logical fallacies with the ad populum when you
said "everyone else" :)

Not that that was true; there are examples of ¹ being used in the
kernel changelog going back decades.

I can just drop the footnote; there's not a huge amount of benefit in
referring back to the old mail thread anyway. The gist of it was
covered in the commit comment.
Sean Christopherson Oct. 2, 2023, 6:45 p.m. UTC | #6
On Mon, Oct 02, 2023, David Woodhouse wrote:
> On Mon, 2023-10-02 at 10:41 -0700, Sean Christopherson wrote:
> > On Fri, Sep 29, 2023, David Woodhouse wrote:
> > > On Fri, 2023-09-29 at 08:16 -0700, Sean Christopherson wrote:
> > > > On Fri, Sep 29, 2023, David Woodhouse wrote:
> > > > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > > > 
> > > > > Most of the time there's no need to kick the vCPU and deliver the timer
> > > > > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast()
> > > > > directly from the timer callback, and only fall back to the slow path
> > > > > when it's necessary to do so.
> > > > 
> > > > It'd be helpful for non-Xen folks to explain "when it's necessary".  IIUC, the
> > > > only time it's necessary is if the gfn=>pfn cache isn't valid/fresh.
> > > 
> > > That's an implementation detail.
> > 
> > And?  The target audience of changelogs are almost always people that care about
> > the implementation.
> >
> > > Like all of the fast path functions that can be called from
> > > kvm_arch_set_irq_inatomic(), it has its own criteria for why it might return
> > > -EWOULDBLOCK or not. Those are *its* business.
> > 
> > And all of the KVM code is the business of the people who contribute to the kernel,
> > now and in the future.  Yeah, there's a small chance that a detailed changelog can
> > become stale if the patch races with some other in-flight change, but even *that*
> > is a useful data point.  E.g. if Paul's patches somehow broke/degraded this code,
> > then knowing that what the author (you) intended/observed didn't match reality when
> > the patch was applied would be extremely useful information for whoever encountered
> > the hypothetical breakage.
> 
> Fair enough, but on this occasion it truly doesn't matter. It has
> nothing to do with the implementation of *this* patch. This code makes
> no assumptions and has no dependency on *when* that fast path might
> return -EWOULDBLOCK. Sometimes it does, sometimes it doesn't. This code
> just doesn't care one iota.
> 
> If this code had *dependencies* on the precise behaviour of
> kvm_xen_set_evtchn_fast() that we needed to reason about, then sure,
> I'd have written those explicitly into the commit comment *and* tried
> to find some way of enforcing them with runtime warnings etc.
> 
> But it doesn't. So I am no more inclined to document the precise
> behaviour of kvm_xen_set_evtchn_fast() in a patch which just happens to
> call it, than I am inclined to document hrtimer_cancel() or any other
> function called from the new code :)

Just because some bit of code doesn't care/differentiate doesn't mean the behavior
of said code is correct.  I agree that adding a comment to explain the gory details
is unnecessary and would lead to stale code.  But changelogs essentially capture a
single point in a time, and a big role of the changelog is to help reviewers and
readers understand (a) the *intent* of the change and (b) whether or not that change
is correct.

E.g. there's an assumption that -EWOULDBLOCK is the only non-zero return code where
the correct response is to go down the slow path.

I'm not asking to spell out every single condition, I'm just asking for clarification
on what the intended behavior is, e.g.

  Use kvm_xen_set_evtchn_fast() directly from the timer callback, and fall
  back to the slow path if the event is valid but fast delivery isn't
  possible, which currently can only happen if delivery needs to block,
  e.g. because the gfn=>pfn cache is invalid or stale.

instead of simply saying "when it's necessary to do so" and leaving it up to the
reader to figure what _they_ think that means, which might not always align with
what the author actually meant.

> > > And in fact one of Paul's current patches is tweaking them subtly, but that
> > > isn't relevant here. (But yes, you are broadly correct in your
> > > understanding.)
> > > 
> > > > > This gives a significant improvement in timer latency testing (using
> > > > > nanosleep() for various periods and then measuring the actual time
> > > > > elapsed).
> > > > > 
> > > > > However, there was a reason¹ the fast path was dropped when this support
> > > > 
> > > > Heh, please use [1] or [*] like everyone else.  I can barely see that tiny little ¹.
> > > 
> > > Isn't that the *point*? The reference to the footnote isn't supposed to
> > > detract from the flow of the main text. It's exactly how you'll see it
> > > when typeset properly.
> >  
> > Footnotes that are "typeset properly" have the entire footnote in a different
> > font+size.  A tiny number next to normal sized text just looks weird to me.
> > 
> > And I often do a "reverse lookup" when I get to footnotes that are links, e.g. to
> > gauge whether or not it's worth my time to follow the link.  Trying to find the
> > tiny ¹ via a quick visual scan is an exercise in frustration, at least for the
> > monospace font I use for reading mail, e.g. it's much more readable on my end in
> > an editor using a different font.
> > 
> > Which is a big benefit to sticking to the old and kludgly ASCII: it provides a
> > fairly consistent experience regardless of what client/font/etc each reader is
> > using.  I'm not completely against using unicode characters, e.g. for names with
> > characters not found in the Latin alphabet, but for code and things like this,
> > IMO simpler is better.
> > 
> > > I've always assumed the people using [1] or [*] just haven't yet realised
> > > that it's the 21st century and we are no longer limited to 7-bit ASCII. Or
> > > haven't worked out how to type anything but ASCII.
> > 
> > Please don't devolve into ad hominem attacks against other reviews and contributors.
> > If you want to argue that using footnote notation unicode is superior in some way,
> > then by all means, present your arguments.
> 
> Hey, you started the logical fallacies with the ad populum when you
> said "everyone else" :)
>
> Not that that was true; there are examples of ¹ being used in the
> kernel changelog going back decades.

LOL, fine, "almost everyone else".
David Woodhouse Oct. 2, 2023, 7:33 p.m. UTC | #7
On Mon, 2023-10-02 at 11:45 -0700, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, David Woodhouse wrote:
> > On Mon, 2023-10-02 at 10:41 -0700, Sean Christopherson wrote:
> > > On Fri, Sep 29, 2023, David Woodhouse wrote:
> > > > On Fri, 2023-09-29 at 08:16 -0700, Sean Christopherson wrote:
> > > > > On Fri, Sep 29, 2023, David Woodhouse wrote:
> > > > > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > > > > 
> > > > > > Most of the time there's no need to kick the vCPU and deliver the timer
> > > > > > event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast()
> > > > > > directly from the timer callback, and only fall back to the slow path
> > > > > > when it's necessary to do so.
> > > > > 
> > > > > It'd be helpful for non-Xen folks to explain "when it's necessary".  IIUC, the
> > > > > only time it's necessary is if the gfn=>pfn cache isn't valid/fresh.
> > > > 
> > > > That's an implementation detail.
> > > 
> > > And?  The target audience of changelogs are almost always people that care about
> > > the implementation.
> > > 
> > > > Like all of the fast path functions that can be called from
> > > > kvm_arch_set_irq_inatomic(), it has its own criteria for why it might return
> > > > -EWOULDBLOCK or not. Those are *its* business.
> > > 
> > > And all of the KVM code is the business of the people who contribute to the kernel,
> > > now and in the future.  Yeah, there's a small chance that a detailed changelog can
> > > become stale if the patch races with some other in-flight change, but even *that*
> > > is a useful data point.  E.g. if Paul's patches somehow broke/degraded this code,
> > > then knowing that what the author (you) intended/observed didn't match reality when
> > > the patch was applied would be extremely useful information for whoever encountered
> > > the hypothetical breakage.
> > 
> > Fair enough, but on this occasion it truly doesn't matter. It has
> > nothing to do with the implementation of *this* patch. This code makes
> > no assumptions and has no dependency on *when* that fast path might
> > return -EWOULDBLOCK. Sometimes it does, sometimes it doesn't. This code
> > just doesn't care one iota.
> > 
> > If this code had *dependencies* on the precise behaviour of
> > kvm_xen_set_evtchn_fast() that we needed to reason about, then sure,
> > I'd have written those explicitly into the commit comment *and* tried
> > to find some way of enforcing them with runtime warnings etc.
> > 
> > But it doesn't. So I am no more inclined to document the precise
> > behaviour of kvm_xen_set_evtchn_fast() in a patch which just happens to
> > call it, than I am inclined to document hrtimer_cancel() or any other
> > function called from the new code :)
> 
> Just because some bit of code doesn't care/differentiate doesn't mean the behavior
> of said code is correct.  I agree that adding a comment to explain the gory details
> is unnecessary and would lead to stale code.  But changelogs essentially capture a
> single point in a time, and a big role of the changelog is to help reviewers and
> readers understand (a) the *intent* of the change and (b) whether or not that change
> is correct.
> 
> E.g. there's an assumption that -EWOULDBLOCK is the only non-zero return code where
> the correct response is to go down the slow path.
> 
> I'm not asking to spell out every single condition, I'm just asking for clarification
> on what the intended behavior is, e.g.
> 
>   Use kvm_xen_set_evtchn_fast() directly from the timer callback, and fall
>   back to the slow path if the event is valid but fast delivery isn't
>   possible, which currently can only happen if delivery needs to block,
>   e.g. because the gfn=>pfn cache is invalid or stale.
> 
> instead of simply saying "when it's necessary to do so" and leaving it up to the
> reader to figure what _they_ think that means, which might not always align with
> what the author actually meant.


Fair enough. There's certainly scope for something along the lines of


+	rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm);
+	if (rc != -EWOULDBLOCK) {

   /*
    * If kvm_xen_set_evtchn_fast() returned -EWOULDBLOCK, then set the
    * timer_pending flag and kick the vCPU, to defer delivery of the 
    * event channel to a context which can sleep. If it fails for any
    * other reasons, just let it fail silently. The slow path fails 
    * silently too; a warning in that case may be guest triggerable,
    * should never happen anyway, and guests are generally going to
    * *notice* timers going missing.
    */

+		vcpu->arch.xen.timer_expires = 0;
+		return HRTIMER_NORESTART;
+	}

That's documenting *this* code, not the function it happens to call.
It's more verbose than I would normally have bothered to be, but I'm
all for improving the level of commenting in our code as long as it's
adding value.
Sean Christopherson Oct. 3, 2023, 4:12 p.m. UTC | #8
On Mon, Oct 02, 2023, David Woodhouse wrote:
> On Mon, 2023-10-02 at 11:45 -0700, Sean Christopherson wrote:
> > E.g. there's an assumption that -EWOULDBLOCK is the only non-zero return code where
> > the correct response is to go down the slow path.
> > 
> > I'm not asking to spell out every single condition, I'm just asking for clarification
> > on what the intended behavior is, e.g.
> > 
> >   Use kvm_xen_set_evtchn_fast() directly from the timer callback, and fall
> >   back to the slow path if the event is valid but fast delivery isn't
> >   possible, which currently can only happen if delivery needs to block,
> >   e.g. because the gfn=>pfn cache is invalid or stale.
> > 
> > instead of simply saying "when it's necessary to do so" and leaving it up to the
> > reader to figure what _they_ think that means, which might not always align with
> > what the author actually meant.
> 
> 
> Fair enough. There's certainly scope for something along the lines of
> 
> 
> +	rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm);
> +	if (rc != -EWOULDBLOCK) {
> 
>    /*
>     * If kvm_xen_set_evtchn_fast() returned -EWOULDBLOCK, then set the
>     * timer_pending flag and kick the vCPU, to defer delivery of the 
>     * event channel to a context which can sleep. If it fails for any
>     * other reasons, just let it fail silently. The slow path fails 
>     * silently too; a warning in that case may be guest triggerable,
>     * should never happen anyway, and guests are generally going to
>     * *notice* timers going missing.
>     */
> 
> +		vcpu->arch.xen.timer_expires = 0;
> +		return HRTIMER_NORESTART;
> +	}
> 
> That's documenting *this* code, not the function it happens to call.
> It's more verbose than I would normally have bothered to be, but I'm
> all for improving the level of commenting in our code as long as it's
> adding value. 

I'm completely ok with no comment, I just want something in the changelog.  I'm
also not opposed to a comment, but I don't think it's necessary.

I don't have a problem with digging around code to understand the subtleties, or
even the high level "what" in many cases.  What I don't like is encountering code
where *nothing* explains the author's intent.  All too often I've encountered
historical code in KVM where it's not at all obvious if code does what the author
intended, e.g. if a bug was a simple goof or a completely misguided design choice.

Holler if you plan on sending a v4 with the comment.  I'm a-ok applying v3 with a
massaged changelog to fold in the gist of the comment.
David Woodhouse Oct. 3, 2023, 5:18 p.m. UTC | #9
On Tue, 2023-10-03 at 09:12 -0700, Sean Christopherson wrote:
> 
> Holler if you plan on sending a v4 with the comment.  I'm a-ok applying v3 with a
> massaged changelog to fold in the gist of the comment.

Yes please. I think you have a far better understanding of precisely
what it is that you're looking for in said comment.

I'm all for good comments, but on this occasion it seems I can't see
the wood for the trees, and I can't quite work out what it is that's
non-obvious.

Thanks!
diff mbox series

Patch

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index fb1110b2385a..9d0d602a2466 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -117,6 +117,8 @@  static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
 
 void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu)
 {
+	WARN_ON_ONCE(vcpu != kvm_get_running_vcpu());
+
 	if (atomic_read(&vcpu->arch.xen.timer_pending) > 0) {
 		struct kvm_xen_evtchn e;
 
@@ -136,18 +138,41 @@  static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
 {
 	struct kvm_vcpu *vcpu = container_of(timer, struct kvm_vcpu,
 					     arch.xen.timer);
+	struct kvm_xen_evtchn e;
+	int rc;
+
 	if (atomic_read(&vcpu->arch.xen.timer_pending))
 		return HRTIMER_NORESTART;
 
-	atomic_inc(&vcpu->arch.xen.timer_pending);
-	kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
-	kvm_vcpu_kick(vcpu);
+	e.vcpu_id = vcpu->vcpu_id;
+	e.vcpu_idx = vcpu->vcpu_idx;
+	e.port = vcpu->arch.xen.timer_virq;
+	e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
+
+	rc = kvm_xen_set_evtchn_fast(&e, vcpu->kvm);
+	if (rc == -EWOULDBLOCK) {
+		atomic_inc(&vcpu->arch.xen.timer_pending);
+		kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
+		kvm_vcpu_kick(vcpu);
+	} else {
+		vcpu->arch.xen.timer_expires = 0;
+	}
 
 	return HRTIMER_NORESTART;
 }
 
 static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
 {
+	WARN_ON_ONCE(vcpu != kvm_get_running_vcpu());
+
+	/*
+	 * Avoid races with the old timer firing. Checking timer_expires
+	 * to avoid calling hrtimer_cancel() will only have false positives
+	 * so is fine.
+	 */
+	if (vcpu->arch.xen.timer_expires)
+		hrtimer_cancel(&vcpu->arch.xen.timer);
+
 	atomic_set(&vcpu->arch.xen.timer_pending, 0);
 	vcpu->arch.xen.timer_expires = guest_abs;
 
@@ -163,6 +188,8 @@  static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_
 
 static void kvm_xen_stop_timer(struct kvm_vcpu *vcpu)
 {
+	WARN_ON_ONCE(vcpu != kvm_get_running_vcpu());
+
 	hrtimer_cancel(&vcpu->arch.xen.timer);
 	vcpu->arch.xen.timer_expires = 0;
 	atomic_set(&vcpu->arch.xen.timer_pending, 0);
@@ -1019,13 +1046,38 @@  int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 		r = 0;
 		break;
 
-	case KVM_XEN_VCPU_ATTR_TYPE_TIMER:
+	case KVM_XEN_VCPU_ATTR_TYPE_TIMER: {
+		bool pending = false;
+
+		/*
+		 * Ensure a consistent snapshot of state is captures, with a
+		 * timer either being pending, or fully delivered. Not still
+		 * lurking in the timer_pending flag for deferred delivery.
+		 */
+		if (vcpu->arch.xen.timer_expires) {
+			pending = hrtimer_cancel(&vcpu->arch.xen.timer);
+			kvm_xen_inject_timer_irqs(vcpu);
+		}
+
 		data->u.timer.port = vcpu->arch.xen.timer_virq;
 		data->u.timer.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
 		data->u.timer.expires_ns = vcpu->arch.xen.timer_expires;
+
+		/*
+		 * The timer may be delivered immediately, while the returned
+		 * state causes it to be set up and delivered again on the
+		 * destination system after migration. That's fine, as the
+		 * guest will not even have had a chance to run and process
+		 * the interrupt by that point, so it won't even notice the
+		 * duplicate IRQ.
+		 */
+		if (pending)
+			hrtimer_start_expires(&vcpu->arch.xen.timer,
+					      HRTIMER_MODE_ABS_HARD);
+
 		r = 0;
 		break;
-
+	}
 	case KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR:
 		data->u.vector = vcpu->arch.xen.upcall_vector;
 		r = 0;
@@ -2085,7 +2137,7 @@  void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu)
 void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu)
 {
 	if (kvm_xen_timer_enabled(vcpu))
-		kvm_xen_stop_timer(vcpu);
+		hrtimer_cancel(&vcpu->arch.xen.timer);
 
 	kvm_gpc_deactivate(&vcpu->arch.xen.runstate_cache);
 	kvm_gpc_deactivate(&vcpu->arch.xen.runstate2_cache);