From patchwork Fri Jul 6 09:47:53 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Nikunj A. Dadhania" X-Patchwork-Id: 1164401 Return-Path: X-Original-To: patchwork-kvm@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork1.kernel.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by patchwork1.kernel.org (Postfix) with ESMTP id C49913FE80 for ; Fri, 6 Jul 2012 09:48:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751010Ab2GFJsi (ORCPT ); Fri, 6 Jul 2012 05:48:38 -0400 Received: from e28smtp05.in.ibm.com ([122.248.162.5]:55072 "EHLO e28smtp05.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750802Ab2GFJsg (ORCPT ); Fri, 6 Jul 2012 05:48:36 -0400 Received: from /spool/local by e28smtp05.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 6 Jul 2012 15:18:33 +0530 Received: from d28relay02.in.ibm.com (9.184.220.59) by e28smtp05.in.ibm.com (192.168.1.135) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Fri, 6 Jul 2012 15:18:30 +0530 Received: from d28av03.in.ibm.com (d28av03.in.ibm.com [9.184.220.65]) by d28relay02.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q669mTKP12583218; Fri, 6 Jul 2012 15:18:29 +0530 Received: from d28av03.in.ibm.com (loopback [127.0.0.1]) by d28av03.in.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q66FHkko005970; Sat, 7 Jul 2012 01:17:48 +1000 Received: from abhimanyu.vnet.linux.ibm.com ([9.79.199.121]) by d28av03.in.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id q66FHdiD005749 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Sat, 7 Jul 2012 01:17:41 +1000 From: Nikunj A Dadhania To: Marcelo Tosatti Cc: peterz@infradead.org, mingo@elte.hu, avi@redhat.com, raghukt@linux.vnet.ibm.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org, jeremy@goop.org, vatsa@linux.vnet.ibm.com, hpa@zytor.com Subject: Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others In-Reply-To: <20120703075535.GA13291@amt.cnet> References: <20120604050223.4560.2874.stgit@abhimanyu.in.ibm.com> <20120604050629.4560.85284.stgit@abhimanyu.in.ibm.com> <20120703075535.GA13291@amt.cnet> User-Agent: Notmuch/0.10.2+70~gf0e0053 (http://notmuchmail.org) Emacs/24.0.95.1 (x86_64-unknown-linux-gnu) Date: Fri, 06 Jul 2012 15:17:53 +0530 Message-ID: <87hatlxlti.fsf@linux.vnet.ibm.com> MIME-Version: 1.0 x-cbid: 12070609-8256-0000-0000-0000033568B4 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Tue, 3 Jul 2012 04:55:35 -0300, Marcelo Tosatti wrote: > On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote: > > flush_tlb_others_ipi depends on lot of statics in tlb.c. Replicated > > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to > > paravirtualization. > > > > Use the vcpu state information inside the kvm_flush_tlb_others to > > avoid sending ipi to pre-empted vcpus. > > > > * Do not send ipi's to offline vcpus and set flush_on_enter flag > > * For online vcpus: Wait for them to clear the flag > > > > The approach was discussed here: https://lkml.org/lkml/2012/2/20/157 > > > > Suggested-by: Peter Zijlstra > > Signed-off-by: Nikunj A. Dadhania > > > > -- > > Pseudo Algo: > > > > Write() > > ====== > > > > guest_exit() > > flush_on_enter[i]=0; > > running[i] = 0; > > > > guest_enter() > > running[i] = 1; > > smp_mb(); > > if(flush_on_enter[i]) { > > tlb_flush() > > flush_on_enter[i]=0; > > } > > > > > > Read() > > ====== > > > > GUEST KVM-HV > > > > f->flushcpumask = cpumask - me; > > > > again: > > for_each_cpu(i, f->flushmask) { > > > > if (!running[i]) { > > case 1: > > > > running[n]=1 > > > > (cpuN does not see > > flush_on_enter set, > > guest later finds it > > running and sends ipi, > > we are fine here, need > > to clear the flag on > > guest_exit) > > > > flush_on_enter[i] = 1; > > case2: > > > > running[n]=1 > > (cpuN - will see flush > > on enter and an IPI as > > well - addressed in patch-4) > > > > if (!running[i]) > > cpu_clear(f->flushmask); All is well, vm_enter > > will do the fixup > > } > > case 3: > > running[n] = 0; > > > > (cpuN went to sleep, > > we saw it as awake, > > ipi sent, but wait > > will break without > > zero_mask and goto > > again will take care) > > > > } > > send_ipi(f->flushmask) > > > > wait_a_while_for_zero_mask(); > > > > if (!zero_mask) > > goto again; > > Can you please measure increased vmentry/vmexit overhead? x86/vmexit.c > of git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git should > help. > Please find below the results (debug patch attached for enabling registration of kvm_vcu_state) I have taken results for 1 and 4 vcpus. Used the following command for starting the tests: /usr/libexec/qemu-kvm -smp $i -device testdev,chardev=testlog -chardev file,id=testlog,path=vmexit.out -serial stdio -kernel ./x86/vmexit.flat Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32 core, 32 online cpus and 4*64GB RAM. x base - unpatched host kernel + wo_vs - patched host kernel, vcpu_state not registered * w_vs.txt - patched host kernel and vcpu_state registered 1 vcpu results: --------------- cpuid ===== N Avg Stddev x 10 2135.1 17.8975 + 10 2188 18.3666 * 10 2448.9 43.9910 vmcall ====== N Avg Stddev x 10 2025.5 38.1641 + 10 2047.5 24.8205 * 10 2306.2 40.3066 mov_from_cr8 ============ N Avg Stddev x 10 12 0.0000 + 10 12 0.0000 * 10 12 0.0000 mov_to_cr8 ========== N Avg Stddev x 10 19.4 0.5164 + 10 19.1 0.3162 * 10 19.2 0.4216 inl_from_pmtimer ================ N Avg Stddev x 10 18093.2 462.0543 + 10 16579.7 1448.8892 * 10 18577.7 266.2676 ple-round-robin =============== N Avg Stddev x 10 16.1 0.3162 + 10 16.2 0.4216 * 10 15.3 0.4830 4 vcpus result -------------- cpuid ===== N Avg Stddev x 10 2135.8 10.0642 + 10 2165 6.4118 * 10 2423.7 12.5526 vmcall ====== N Avg Stddev x 10 2028.3 19.6641 + 10 2024.7 7.2273 * 10 2276.1 13.8680 mov_from_cr8 ============ N Avg Stddev x 10 12 0.0000 + 10 12 0.0000 * 10 12 0.0000 mov_to_cr8 ========== N Avg Stddev x 10 19 0.0000 + 10 19 0.0000 * 10 19 0.0000 inl_from_pmtimer ================ N Avg Stddev x 10 25574.2 1693.5374 + 10 25190.7 2219.9223 * 10 23044 1230.8737 ipi === N Avg Stddev x 20 31996.75 7290.1777 + 20 33683.25 9795.1601 * 20 34563.5 8338.7826 ple-round-robin =============== N Avg Stddev x 10 6281.7 1543.8601 + 10 6149.8 1207.7928 * 10 6433.3 2304.5377 Thanks Nikunj Enable and register vcpu_state information to the host Signed-off-by: Nikunj A. Dadhania diff --git a/x86/vmexit.c b/x86/vmexit.c index ad8ab55..a9823c9 100644 --- a/x86/vmexit.c +++ b/x86/vmexit.c @@ -3,6 +3,7 @@ #include "smp.h" #include "processor.h" #include "atomic.h" +#include "vm.h" static unsigned int inl(unsigned short port) { @@ -173,10 +174,45 @@ static void enable_nx(void *junk) wrmsr(MSR_EFER, rdmsr(MSR_EFER) | EFER_NX_MASK); } +#define KVM_MSR_ENABLED 1 +#define KVM_FEATURE_VCPU_STATE 7 +#define MSR_KVM_VCPU_STATE 0x4b564d04 + +struct kvm_vcpu_state { + int state; + int flush_on_enter; + int pad[14]; +}; + +struct kvm_vcpu_state test[4]; + +static inline void my_wrmsr(unsigned int msr, + unsigned low, unsigned high) +{ + asm volatile("wrmsr" : : "c" (msr), "a"(low), "d" (high) : "memory"); +} +#define wrmsrl(msr, val) my_wrmsr(msr, (u32)((u64)(val)), ((u64)(val))>>32) + +static void enable_vcpu_state(void *junk) +{ + struct kvm_vcpu_state *vs; + int me = smp_id(); + + if (cpuid(0x80000001).d & (1 << KVM_FEATURE_VCPU_STATE)) { + vs = &test[me]; + memset(vs, 0, sizeof(struct kvm_vcpu_state)); + + wrmsrl(MSR_KVM_VCPU_STATE, ((unsigned long)(vs) | KVM_MSR_ENABLED)); + printf("%d: Done vcpu state %p\n", me, virt_to_phys((void*)vs)); + } +} + bool test_wanted(struct test *test, char *wanted[], int nwanted) { int i; + return true; + if (!nwanted) return true; @@ -192,11 +228,16 @@ int main(int ac, char **av) int i; smp_init(); + setup_vm(); + nr_cpus = cpu_count(); for (i = cpu_count(); i > 0; i--) on_cpu(i-1, enable_nx, 0); + for (i = cpu_count(); i > 0; i--) + on_cpu(i-1, enable_vcpu_state, 0); + for (i = 0; i < ARRAY_SIZE(tests); ++i) if (test_wanted(&tests[i], av + 1, ac - 1)) do_test(&tests[i]);