KVM: VMX: Optimize flushing the PML buffer

Message ID	20210204221959.232582-1-bgardon@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Sender: "bgardon via sendgmr" <bgardon@bgardon.sea.corp.google.com> Date: Thu, 4 Feb 2021 14:19:59 -0800 Message-Id: <20210204221959.232582-1-bgardon@google.com> Mime-Version: 1.0 Subject: [PATCH] KVM: VMX: Optimize flushing the PML buffer From: Ben Gardon <bgardon@google.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Peter Xu <peterx@redhat.com>, Sean Christopherson <seanjc@google.com>, Peter Shier <pshier@google.com>, Junaid Shahid <junaids@google.com>, Jim Mattson <jmattson@google.com>, Makarand Sonare <makarandsonare@google.com>, Kai Huang <kai.huang@intel.com>, Ben Gardon <bgardon@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	KVM: VMX: Optimize flushing the PML buffer \| expand KVM: VMX: Optimize flushing the PML buffer

Message ID

20210204221959.232582-1-bgardon@google.com (mailing list archive)

State

New, archived

Headers

Sender: "bgardon via sendgmr" <bgardon@bgardon.sea.corp.google.com>
Date: Thu,  4 Feb 2021 14:19:59 -0800
Message-Id: <20210204221959.232582-1-bgardon@google.com>
Mime-Version: 1.0
Subject: [PATCH] KVM: VMX: Optimize flushing the PML buffer
From: Ben Gardon <bgardon@google.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>, Peter Xu <peterx@redhat.com>,
        Sean Christopherson <seanjc@google.com>,
        Peter Shier <pshier@google.com>,
        Junaid Shahid <junaids@google.com>,
        Jim Mattson <jmattson@google.com>,
        Makarand Sonare <makarandsonare@google.com>,
        Kai Huang <kai.huang@intel.com>,
        Ben Gardon <bgardon@google.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

Series

KVM: VMX: Optimize flushing the PML buffer | expand

Commit Message

Ben Gardon Feb. 4, 2021, 10:19 p.m. UTC

vmx_flush_pml_buffer repeatedly calls kvm_vcpu_mark_page_dirty, which
SRCU-derefrences kvm->memslots. In order to give the compiler more
freedom to optimize the function, SRCU-dereference the pointer
kvm->memslots only once.

Reviewed-by: Makarand Sonare <makarandsonare@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>

---

Tested by running the dirty_log_perf_test selftest on a dual socket Intel
Skylake machine:
./dirty_log_perf_test -v 4 -b 30G -i 5

The test was run 5 times with and without this patch and the dirty
memory time for iterations 2-5 was averaged across the 5 runs.
Iteration 1 was discarded for this analysis because it is still dominated
by the time spent populating memory.

The average time for each run demonstrated a strange bimodal distribution,
with clusters around 2 seconds and 2.5 seconds. This may have been a
result of vCPU migration between NUMA nodes.

In any case, the get dirty times with this patch averaged to 2.07
seconds, a 7% savings from the 2.22 second everage without this patch.

While these savings may be partly a result of the patched runs having
one more 2 second clustered run, the patched runs in the higer cluster
were also 7-8% shorter than those in the unpatched case.

Below is the raw data for anyone interested in visualizing the results
with a graph:
Iteration	Baseline	Patched
2		2.038562907	2.045226614
3		2.037363248	2.045033709
4		2.037176331	1.999783966
5		1.999891981	2.007849104
2		2.569526298	2.001252504
3		2.579110209	2.008541897
4		2.585883731	2.005317983
5		2.588692727	2.007100987
2		2.01191437	2.006953735
3		2.012972236	2.04540153
4		1.968836017	2.005035246
5		1.967915154	2.003859551
2		2.037533296	1.991275846
3		2.501480125	2.391886691
4		2.454382587	2.391904789
5		2.461046772	2.398767963
2		2.036991484	2.011331436
3		2.002954418	2.002635687
4		2.053342717	2.006769959
5		2.522539759	2.006470059
Average		2.223405818	2.069119963

 arch/x86/kvm/vmx/vmx.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Peter Xu Feb. 4, 2021, 10:51 p.m. UTC | #1

Hi, Ben,

On Thu, Feb 04, 2021 at 02:19:59PM -0800, Ben Gardon wrote:
> The average time for each run demonstrated a strange bimodal distribution,
> with clusters around 2 seconds and 2.5 seconds. This may have been a
> result of vCPU migration between NUMA nodes.

Have you thought about using numactl or similar technique to verify your idea
(force both vcpu threads binding, and memory allocations)?

From the numbers it already shows improvements indeed, but just curious since
you raised this up. :)

> @@ -5707,13 +5708,18 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
>  	else
>  		pml_idx++;
>  
> +	memslots = kvm_vcpu_memslots(vcpu);
> +
>  	pml_buf = page_address(vmx->pml_pg);
>  	for (; pml_idx < PML_ENTITY_NUM; pml_idx++) {
> +		struct kvm_memory_slot *memslot;
>  		u64 gpa;
>  
>  		gpa = pml_buf[pml_idx];
>  		WARN_ON(gpa & (PAGE_SIZE - 1));
> -		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> +
> +		memslot = __gfn_to_memslot(memslots, gpa >> PAGE_SHIFT);
> +		mark_page_dirty_in_slot(vcpu->kvm, memslot, gpa >> PAGE_SHIFT);

Since at it: make "gpa >> PAGE_SHIFT" a temp var too?

Thanks,

Ben Gardon Feb. 5, 2021, 12:02 a.m. UTC | #2

On Thu, Feb 4, 2021 at 2:51 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Ben,
>
> On Thu, Feb 04, 2021 at 02:19:59PM -0800, Ben Gardon wrote:
> > The average time for each run demonstrated a strange bimodal distribution,
> > with clusters around 2 seconds and 2.5 seconds. This may have been a
> > result of vCPU migration between NUMA nodes.
>
> Have you thought about using numactl or similar technique to verify your idea
> (force both vcpu threads binding, and memory allocations)?
>
> From the numbers it already shows improvements indeed, but just curious since
> you raised this up. :)

Frustratingly, the test machines I have don't have numactl installed
but I've been meaning to add cpu pinning to the selftests perf tests
anyway, so maybe this is a good reason to do it.

>
> > @@ -5707,13 +5708,18 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
> >       else
> >               pml_idx++;
> >
> > +     memslots = kvm_vcpu_memslots(vcpu);
> > +
> >       pml_buf = page_address(vmx->pml_pg);
> >       for (; pml_idx < PML_ENTITY_NUM; pml_idx++) {
> > +             struct kvm_memory_slot *memslot;
> >               u64 gpa;
> >
> >               gpa = pml_buf[pml_idx];
> >               WARN_ON(gpa & (PAGE_SIZE - 1));
> > -             kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> > +
> > +             memslot = __gfn_to_memslot(memslots, gpa >> PAGE_SHIFT);
> > +             mark_page_dirty_in_slot(vcpu->kvm, memslot, gpa >> PAGE_SHIFT);
>
> Since at it: make "gpa >> PAGE_SHIFT" a temp var too?

That's a good idea, I'll try it.

>
> Thanks,
>
> --
> Peter Xu
>

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cc60b1fc3ee7..46c54802dfdb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5692,6 +5692,7 @@  static void vmx_destroy_pml_buffer(struct vcpu_vmx *vmx)
 static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct kvm_memslots *memslots;
 	u64 *pml_buf;
 	u16 pml_idx;
 
@@ -5707,13 +5708,18 @@  static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
 	else
 		pml_idx++;
 
+	memslots = kvm_vcpu_memslots(vcpu);
+
 	pml_buf = page_address(vmx->pml_pg);
 	for (; pml_idx < PML_ENTITY_NUM; pml_idx++) {
+		struct kvm_memory_slot *memslot;
 		u64 gpa;
 
 		gpa = pml_buf[pml_idx];
 		WARN_ON(gpa & (PAGE_SIZE - 1));
-		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
+
+		memslot = __gfn_to_memslot(memslots, gpa >> PAGE_SHIFT);
+		mark_page_dirty_in_slot(vcpu->kvm, memslot, gpa >> PAGE_SHIFT);
 	}
 
 	/* reset PML index */