[3/3] kvm/powerpc: report guest steal time in host
diff mbox

Message ID c0db7eafdd13fcec26a0398e2ad0b7225c3998e9.1430913088.git.naveen.n.rao@linux.vnet.ibm.com
State New
Headers show

Commit Message

Naveen N. Rao May 6, 2015, 11:56 a.m. UTC
On powerpc, kvm tracks both the guest steal time as well as the time
when guest was idle and this gets sent in to the guest through DTL. The
guest accounts these entries as either steal time or idle time based on
the last running task. Since the true guest idle status is not visible
to the host, we can't accurately expose the guest steal time in the
host.

However, tracking the guest vcpu cede status can get us a reasonable
(within 5% variation) vcpu steal time since guest vcpus cede the
processor on entering the idle task. To do this, we introduce a new
field ceded_st in kvm_vcpu_arch structure to accurately track the guest
vcpu cede status (this is needed since the existing ceded field is
modified before we can use it). During DTL entry creation, we check this
flag and account the time as stolen if the guest vcpu had not ceded.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
Tests show that the steal time being reported in the host with this approach is
around 5% higher than the steal time shown in guest. I'd be interested to know
if there are ways to achieve better accounting of the guest steal time in host.

Thanks!
- Naveen

 arch/powerpc/include/asm/kvm_host.h     | 1 +
 arch/powerpc/kernel/asm-offsets.c       | 1 +
 arch/powerpc/kvm/book3s_hv.c            | 2 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 3 +++
 4 files changed, 7 insertions(+)

Comments

Christian Borntraeger May 6, 2015, 12:46 p.m. UTC | #1
Am 06.05.2015 um 13:56 schrieb Naveen N. Rao:
> On powerpc, kvm tracks both the guest steal time as well as the time
> when guest was idle and this gets sent in to the guest through DTL. The
> guest accounts these entries as either steal time or idle time based on
> the last running task. Since the true guest idle status is not visible
> to the host, we can't accurately expose the guest steal time in the
> host.
> 
> However, tracking the guest vcpu cede status can get us a reasonable
> (within 5% variation) vcpu steal time since guest vcpus cede the
> processor on entering the idle task. To do this, we introduce a new
> field ceded_st in kvm_vcpu_arch structure to accurately track the guest
> vcpu cede status (this is needed since the existing ceded field is
> modified before we can use it). During DTL entry creation, we check this
> flag and account the time as stolen if the guest vcpu had not ceded.

I think this is more or less a question about the semantic:

What would happen if you use  current->sched_info.run_delay like x86 also
on power? How far are the numbers away? My feeling is, that the semantics
of "steal time" inside the guest is somewhat different on each platform. 

This brings me to a 2nd question:
Do you need to match the host view of guest steal time with the guest view
or do we want to have a host view that translates as "this is the time that
the guest was runnable but we were too busy to schedule him"?
For the former x86 has the best solution, as the host tells the guest its
understanding of steal - so both match. For the latter we actually try to
give guest steal a meaning in the host context  - the overload.
Would /proc/<pid>/schedstat value 2 (time spent waiting on a runqueue)
meet your requirements from the cover-letter?

Christian


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Naveen N. Rao May 6, 2015, 4:42 p.m. UTC | #2
On 2015/05/06 02:46PM, Christian Borntraeger wrote:
> Am 06.05.2015 um 13:56 schrieb Naveen N. Rao:
> > On powerpc, kvm tracks both the guest steal time as well as the time
> > when guest was idle and this gets sent in to the guest through DTL. The
> > guest accounts these entries as either steal time or idle time based on
> > the last running task. Since the true guest idle status is not visible
> > to the host, we can't accurately expose the guest steal time in the
> > host.
> > 
> > However, tracking the guest vcpu cede status can get us a reasonable
> > (within 5% variation) vcpu steal time since guest vcpus cede the
> > processor on entering the idle task. To do this, we introduce a new
> > field ceded_st in kvm_vcpu_arch structure to accurately track the guest
> > vcpu cede status (this is needed since the existing ceded field is
> > modified before we can use it). During DTL entry creation, we check this
> > flag and account the time as stolen if the guest vcpu had not ceded.
> 
> I think this is more or less a question about the semantic:
> 
> What would happen if you use  current->sched_info.run_delay like x86 also
> on power? How far are the numbers away?

The numbers were quite off and didn't quite make sense.

> My feeling is, that the semantics
> of "steal time" inside the guest is somewhat different on each platform. 
> 
> This brings me to a 2nd question:
> Do you need to match the host view of guest steal time with the guest view
> or do we want to have a host view that translates as "this is the time that
> the guest was runnable but we were too busy to schedule him"?

Very good point. This is probably good enough for our purpose and I'd 
like to think my current patchset does something similar for powerpc. We 
don't report the exact steal time as seen from within the guest, but a 
close approximation of it. We count all time that a vcpu was not-idle as 
steal. This includes time we were doing something in the host on behalf 
of the vcpu as well as time when we were just doing something else. I 
don't know if we can separate these two or if that would be desirable.  
The scheduler statistics don't seem to accurately reflect this on ppc.

> For the former x86 has the best solution, as the host tells the guest its
> understanding of steal - so both match. For the latter we actually try to
> give guest steal a meaning in the host context  - the overload.
> Would /proc/<pid>/schedstat value 2 (time spent waiting on a runqueue)
> meet your requirements from the cover-letter?

This looks to be the same as sched_info.run_delay, which doesn't seem to 
reflect the wait on the runqueue. I will recheck this on ppc tomorrow.

As an aside, do you happen to know if /proc/<pid>/schedstat accurately 
reports the "overload" on s390?


Thanks!
- Naveen

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Borntraeger May 7, 2015, 12:04 p.m. UTC | #3
Am 06.05.2015 um 18:42 schrieb Naveen N. Rao:
> On 2015/05/06 02:46PM, Christian Borntraeger wrote:
>> Am 06.05.2015 um 13:56 schrieb Naveen N. Rao:
>>> On powerpc, kvm tracks both the guest steal time as well as the time
>>> when guest was idle and this gets sent in to the guest through DTL. The
>>> guest accounts these entries as either steal time or idle time based on
>>> the last running task. Since the true guest idle status is not visible
>>> to the host, we can't accurately expose the guest steal time in the
>>> host.
>>>
>>> However, tracking the guest vcpu cede status can get us a reasonable
>>> (within 5% variation) vcpu steal time since guest vcpus cede the
>>> processor on entering the idle task. To do this, we introduce a new
>>> field ceded_st in kvm_vcpu_arch structure to accurately track the guest
>>> vcpu cede status (this is needed since the existing ceded field is
>>> modified before we can use it). During DTL entry creation, we check this
>>> flag and account the time as stolen if the guest vcpu had not ceded.
>>
>> I think this is more or less a question about the semantic:
>>
>> What would happen if you use  current->sched_info.run_delay like x86 also
>> on power? How far are the numbers away?
> 
> The numbers were quite off and didn't quite make sense.

Strange. I would expect to match at least the wall clock time between
runnable and running. Maybe its just a bug?


> 
>> My feeling is, that the semantics
>> of "steal time" inside the guest is somewhat different on each platform. 
>>
>> This brings me to a 2nd question:
>> Do you need to match the host view of guest steal time with the guest view
>> or do we want to have a host view that translates as "this is the time that
>> the guest was runnable but we were too busy to schedule him"?
> 
> Very good point. This is probably good enough for our purpose and I'd 
> like to think my current patchset does something similar for powerpc. We 
> don't report the exact steal time as seen from within the guest, but a 
> close approximation of it. We count all time that a vcpu was not-idle as 
> steal. This includes time we were doing something in the host on behalf 
> of the vcpu as well as time when we were just doing something else. I 
> don't know if we can separate these two or if that would be desirable.  
> The scheduler statistics don't seem to accurately reflect this on ppc.
> 
>> For the former x86 has the best solution, as the host tells the guest its
>> understanding of steal - so both match. For the latter we actually try to
>> give guest steal a meaning in the host context  - the overload.
>> Would /proc/<pid>/schedstat value 2 (time spent waiting on a runqueue)
>> meet your requirements from the cover-letter?
> 
> This looks to be the same as sched_info.run_delay, which doesn't seem to 
> reflect the wait on the runqueue. I will recheck this on ppc tomorrow.
> 
> As an aside, do you happen to know if /proc/<pid>/schedstat accurately 
> reports the "overload" on s390?

Things are usually even more complicated as we always have the LPAR hypervisor
below the KVM or z/VM hypervisor (KVM or z/VM guests are always nested so to
speak). Depending on the overcommit on LPAR level the wall clock times might 
indicate a problem in a "wrong" place. 

Now the steal time in a kvm guest is actually precise as the hardware will
step the guest cpu timer only when both LPAR and KVM have this CPU scheduled.
This will also cause "steal" when KVM emulates an instruction for the guest - 
unless we correct the guest view - which we dont right now.
The Linux in LPAR also sees the steal time it got stolen by LPAR.

I really have not looked closely at run_delay. My assumption is that
it boils down to "wall clock time between runnable and running". If the
admin does overcommit in KVM and LPAR is just slightly  overcommitted this
is probably good enough. If the overcommit happens at LPAR then the value
might be confusing. I would assume that people overcommit at the z/VM or KVM
level and the LPAR is managed with less overcommit - but thats not a given.

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index a193a13..48cafd6 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -661,6 +661,7 @@  struct kvm_vcpu_arch {
 	u64 busy_preempt;
 
 	u32 emul_inst;
+	u8 ceded_st;
 #endif
 
 #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 0034b6b..7c11c84 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -534,6 +534,7 @@  int main(void)
 	DEFINE(VCPU_DEC_EXPIRES, offsetof(struct kvm_vcpu, arch.dec_expires));
 	DEFINE(VCPU_PENDING_EXC, offsetof(struct kvm_vcpu, arch.pending_exceptions));
 	DEFINE(VCPU_CEDED, offsetof(struct kvm_vcpu, arch.ceded));
+	DEFINE(VCPU_CEDED_ST, offsetof(struct kvm_vcpu, arch.ceded_st));
 	DEFINE(VCPU_PRODDED, offsetof(struct kvm_vcpu, arch.prodded));
 	DEFINE(VCPU_MMCR, offsetof(struct kvm_vcpu, arch.mmcr));
 	DEFINE(VCPU_PMC, offsetof(struct kvm_vcpu, arch.pmc));
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 48d3c5d..7a7e3ab 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -565,6 +565,8 @@  static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
 	spin_lock_irq(&vcpu->arch.tbacct_lock);
 	stolen += vcpu->arch.busy_stolen;
 	vcpu->arch.busy_stolen = 0;
+	if (!vcpu->arch.ceded_st && stolen)
+		(pid_task(vcpu->pid, PIDTYPE_PID))->gstime += stolen;
 	spin_unlock_irq(&vcpu->arch.tbacct_lock);
 	if (!dt || !vpa)
 		return;
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 4d70df2..80efc31 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -924,6 +924,7 @@  deliver_guest_interrupt:
 fast_guest_return:
 	li	r0,0
 	stb	r0,VCPU_CEDED(r4)	/* cancel cede */
+	stb	r0,VCPU_CEDED_ST(r4)	/* cancel cede */
 	mtspr	SPRN_HSRR0,r10
 	mtspr	SPRN_HSRR1,r11
 
@@ -2059,6 +2060,7 @@  _GLOBAL(kvmppc_h_cede)		/* r3 = vcpu pointer, r11 = msr, r13 = paca */
 	std	r11,VCPU_MSR(r3)
 	li	r0,1
 	stb	r0,VCPU_CEDED(r3)
+	stb	r0,VCPU_CEDED_ST(r3)
 	sync			/* order setting ceded vs. testing prodded */
 	lbz	r5,VCPU_PRODDED(r3)
 	cmpwi	r5,0
@@ -2266,6 +2268,7 @@  kvm_cede_prodded:
 	stb	r0,VCPU_PRODDED(r3)
 	sync			/* order testing prodded vs. clearing ceded */
 	stb	r0,VCPU_CEDED(r3)
+	stb	r0,VCPU_CEDED_ST(r3)
 	li	r3,H_SUCCESS
 	blr