[v2,0/4] implement vcpu preempted check
diff mbox

Message ID 20160707094215.GT30921@twins.programming.kicks-ass.net
State New
Headers show

Commit Message

Peter Zijlstra July 7, 2016, 9:42 a.m. UTC
On Thu, Jul 07, 2016 at 04:48:05PM +0800, Wanpeng Li wrote:
> 2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
> > Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
> > indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
> > VCPU has been scheduled out since the last time the guest reset the bit.
> >  The guest can use an xchg to test-and-clear it.  The bit can be
> > accessed at any time, independent of the version field.
> 
> If one vCPU is preempted, and guest check it several times before this
> vCPU is scheded in, then the first time we can get "vCPU is
> preempted", however, since the field is cleared, the second time we
> will get "vCPU is running".
> 
> Do you mean we should call record_steal_time() in both kvm_sched_in()
> and kvm_sched_out() to record this field? Btw, if we should keep both
> vcpu->preempted and kvm_steal_time's "vCPU preempted" field present
> simultaneous?

I suspect you want something like so; except this has holes in.

We clear KVM_ST_PAD_PREEMPT before disabling preemption and we set it
after enabling it, this means that if we get preempted in between, the
vcpu is reported as running even though it very much is not.

Fixing that requires much larger surgery.

---
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Wanpeng Li July 7, 2016, 10:12 a.m. UTC | #1
2016-07-07 17:42 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
> On Thu, Jul 07, 2016 at 04:48:05PM +0800, Wanpeng Li wrote:
>> 2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>> > Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
>> > indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
>> > VCPU has been scheduled out since the last time the guest reset the bit.
>> >  The guest can use an xchg to test-and-clear it.  The bit can be
>> > accessed at any time, independent of the version field.
>>
>> If one vCPU is preempted, and guest check it several times before this
>> vCPU is scheded in, then the first time we can get "vCPU is
>> preempted", however, since the field is cleared, the second time we
>> will get "vCPU is running".
>>
>> Do you mean we should call record_steal_time() in both kvm_sched_in()
>> and kvm_sched_out() to record this field? Btw, if we should keep both
>> vcpu->preempted and kvm_steal_time's "vCPU preempted" field present
>> simultaneous?
>
> I suspect you want something like so; except this has holes in.
>
> We clear KVM_ST_PAD_PREEMPT before disabling preemption and we set it
> after enabling it, this means that if we get preempted in between, the
> vcpu is reported as running even though it very much is not.

Paolo also point out this to me offline yesterday: "Please change
pad[12] to "__u32 preempted; __u32 pad[11];" too, and remember to
update Documentation/virtual/kvm/msr.txt!". Btw, do this in preemption
notifier means that the vCPU is real preempted on host, however,
depends on vmexit is different semantic I think.

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wanpeng Li July 7, 2016, 10:27 a.m. UTC | #2
2016-07-07 18:12 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-07-07 17:42 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
>> On Thu, Jul 07, 2016 at 04:48:05PM +0800, Wanpeng Li wrote:
>>> 2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>>> > Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
>>> > indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
>>> > VCPU has been scheduled out since the last time the guest reset the bit.
>>> >  The guest can use an xchg to test-and-clear it.  The bit can be
>>> > accessed at any time, independent of the version field.
>>>
>>> If one vCPU is preempted, and guest check it several times before this
>>> vCPU is scheded in, then the first time we can get "vCPU is
>>> preempted", however, since the field is cleared, the second time we
>>> will get "vCPU is running".
>>>
>>> Do you mean we should call record_steal_time() in both kvm_sched_in()
>>> and kvm_sched_out() to record this field? Btw, if we should keep both
>>> vcpu->preempted and kvm_steal_time's "vCPU preempted" field present
>>> simultaneous?
>>
>> I suspect you want something like so; except this has holes in.
>>
>> We clear KVM_ST_PAD_PREEMPT before disabling preemption and we set it
>> after enabling it, this means that if we get preempted in between, the
>> vcpu is reported as running even though it very much is not.
>
> Paolo also point out this to me offline yesterday: "Please change
> pad[12] to "__u32 preempted; __u32 pad[11];" too, and remember to
> update Documentation/virtual/kvm/msr.txt!". Btw, do this in preemption
> notifier means that the vCPU is real preempted on host, however,
> depends on vmexit is different semantic I think.

In addition, I see xen's vcpu_runstate_info::state is updated during
schedule, so I think I can do this similarly through kvm preemption
notifier. IIUC, xen hypervisor has VCPUOP_get_runstate_info hypercall
implemention, so the desired interface can be implemented if they add
hypercall callsite in domU. I can add hypercall to kvm similarly.

Paolo, thoughts?

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra July 7, 2016, 11:08 a.m. UTC | #3
On Thu, Jul 07, 2016 at 06:12:51PM +0800, Wanpeng Li wrote:

> Btw, do this in preemption
> notifier means that the vCPU is real preempted on host, however,
> depends on vmexit is different semantic I think.

Not sure; suppose the vcpu is about to reenter, eg, we're in
vcpu_enter_guest() but before the preempt_disable() and the thread gets
preempted. Are we then not preempted? The vcpu might still very much be
in running state but had to service an vmexit due to an MSR or IO port
or whatnot.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra July 7, 2016, 11:09 a.m. UTC | #4
On Thu, Jul 07, 2016 at 11:42:15AM +0200, Peter Zijlstra wrote:
> I suspect you want something like so; except this has holes in.
> 
> We clear KVM_ST_PAD_PREEMPT before disabling preemption and we set it
> after enabling it, this means that if we get preempted in between, the
> vcpu is reported as running even though it very much is not.
> 
> Fixing that requires much larger surgery.

Note that this same hole is already a 'problem' for steal time
accounting. The thread can accrue further delays (iow steal time) after
we've called record_steal_time(). These delays will go unaccounted.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra July 7, 2016, 11:15 a.m. UTC | #5
On Thu, Jul 07, 2016 at 06:27:26PM +0800, Wanpeng Li wrote:
> In addition, I see xen's vcpu_runstate_info::state is updated during
> schedule, so I think I can do this similarly through kvm preemption
> notifier. IIUC, xen hypervisor has VCPUOP_get_runstate_info hypercall
> implemention, so the desired interface can be implemented if they add
> hypercall callsite in domU. I can add hypercall to kvm similarly.

So I suspect Xen has the page its writing to pinned in memory; so that a
write to it is guaranteed to not fault.

Otherwise I cannot see this working.

That is part of the larger surgery required for KVM steal time to get
'fixed'. Currently that steal time stuff uses kvm_write_guest_cached()
which appears to be able to fault in guest pages.

Or I'm not reading this stuff right; which is entirely possible.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra July 7, 2016, 11:21 a.m. UTC | #6
On Thu, Jul 07, 2016 at 11:42:15AM +0200, Peter Zijlstra wrote:
> +static void update_steal_time_preempt(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_steal_time *st;
> +
> +	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
> +		return;
> +
> +	if (unlikely(kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
> +		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
> +		return;
> +
> +	st = &vcpu->arch.st.steal;
> +
> +	st->pad[KVM_ST_PAD_PREEMPT] = 1; /* we've stopped running */

So maybe have this be:

	... = kvm_vcpu_running();

That avoids marking the vcpu preempted when we do hlt and such.

> +	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
> +		st, sizeof(struct kvm_steal_time));
> +}
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b2766723c951..117270df43b6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1997,8 +1997,29 @@  static void kvmclock_reset(struct kvm_vcpu *vcpu)
 	vcpu->arch.pv_time_enabled = false;
 }
 
+static void update_steal_time_preempt(struct kvm_vcpu *vcpu)
+{
+	struct kvm_steal_time *st;
+
+	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
+		return;
+
+	if (unlikely(kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
+		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
+		return;
+
+	st = &vcpu->arch.st.steal;
+
+	st->pad[KVM_ST_PAD_PREEMPT] = 1; /* we've stopped running */
+
+	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
+		st, sizeof(struct kvm_steal_time));
+}
+
 static void record_steal_time(struct kvm_vcpu *vcpu)
 {
+	struct kvm_steal_time *st;
+
 	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
 		return;
 
@@ -2006,29 +2027,34 @@  static void record_steal_time(struct kvm_vcpu *vcpu)
 		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
 		return;
 
-	if (vcpu->arch.st.steal.version & 1)
-		vcpu->arch.st.steal.version += 1;  /* first time write, random junk */
+	st = &vcpu->arch.st.steal;
+
+	if (st->version & 1) {
+		st->flags = KVM_ST_FLAG_PREEMPT;
+		st->version += 1;  /* first time write, random junk */
+	}
 
-	vcpu->arch.st.steal.version += 1;
+	st->version += 1;
 
 	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
-		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
+		st, sizeof(struct kvm_steal_time));
 
 	smp_wmb();
 
-	vcpu->arch.st.steal.steal += current->sched_info.run_delay -
+	st->steal += current->sched_info.run_delay -
 		vcpu->arch.st.last_steal;
 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
+	st->pad[KVM_ST_PAD_PREEMPT] = 0; /* we're about to start running */
 
 	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
-		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
+		st, sizeof(struct kvm_steal_time));
 
 	smp_wmb();
 
-	vcpu->arch.st.steal.version += 1;
+	st->version += 1;
 
 	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
-		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
+		st, sizeof(struct kvm_steal_time));
 }
 
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
@@ -6693,6 +6719,8 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	preempt_enable();
 
+	update_steal_time_preempt(vcpu);
+
 	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
 
 	/*