diff mbox series

[v2,8/9] KVM: X86: Optimize pte_list_desc with per-array counter

Message ID 20210625153415.43620-1-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: X86: Some light optimizations on rmap logic | expand

Commit Message

Peter Xu June 25, 2021, 3:34 p.m. UTC
Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
logic.  E.g., we don't need to loop over the array any more for most reasons.

This will make more sense after we've switched the array size to be larger
otherwise the counter will be a waste.

Initially I wanted to store a tail pointer at the head of the array list so we
don't need to traverse the list at least for pushing new ones (if without the
counter we traverse both the list and the array).  However that'll need
slightly more change without a huge lot benefit, e.g., after we grow entry
numbers per array the list traversing is not so expensive.

So let's be simple but still try to get as much benefit as we can with just
these extra few lines of changes (not to mention the code looks easier too
without looping over arrays).

I used the same a test case to fork 500 child and recycle them ("./rmap_fork
500" [1]), this patch further speeds up the total fork time of about 14%, which
is a total of 38% of vanilla kernel:

        Vanilla:      367.20 (+-4.58%)
        3->15 slots:  302.00 (+-5.30%)
        Add counter:  265.20 (+-9.88%)

[1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

Comments

Sean Christopherson July 28, 2021, 9:04 p.m. UTC | #1
On Fri, Jun 25, 2021, Peter Xu wrote:
> Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
> logic.  E.g., we don't need to loop over the array any more for most reasons.
> 
> This will make more sense after we've switched the array size to be larger
> otherwise the counter will be a waste.
> 
> Initially I wanted to store a tail pointer at the head of the array list so we
> don't need to traverse the list at least for pushing new ones (if without the
> counter we traverse both the list and the array).  However that'll need
> slightly more change without a huge lot benefit, e.g., after we grow entry
> numbers per array the list traversing is not so expensive.
> 
> So let's be simple but still try to get as much benefit as we can with just
> these extra few lines of changes (not to mention the code looks easier too
> without looping over arrays).
> 
> I used the same a test case to fork 500 child and recycle them ("./rmap_fork
> 500" [1]), this patch further speeds up the total fork time of about 14%, which
> is a total of 38% of vanilla kernel:
> 
>         Vanilla:      367.20 (+-4.58%)
>         3->15 slots:  302.00 (+-5.30%)
>         Add counter:  265.20 (+-9.88%)
> 
> [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 33 ++++++++++++++++++---------------
>  1 file changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9b093985a2ef..ba0258bdebc4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -138,10 +138,15 @@ module_param(dbg, bool, 0644);
>  #include <trace/events/kvm.h>
>  
>  /* make pte_list_desc fit well in cache lines */
> -#define PTE_LIST_EXT 15
> +#define PTE_LIST_EXT 14

Doh, I looked at kvm/queue code before looking at the full series.

>  struct pte_list_desc {
>  	u64 *sptes[PTE_LIST_EXT];
> +	/*
> +	 * Stores number of entries stored in the pte_list_desc.  No need to be
> +	 * u64 but just for easier alignment.  When PTE_LIST_EXT, means full.
> +	 */
> +	u64 spte_count;

Per my feedback to the previous patch, this should be above sptes[] so that rmaps
with <8 SPTEs only touch one cache line.  No idea if it actually matters in
practice, but I can't see how it would harm anything.

>  	struct pte_list_desc *more;
>  };
Peter Xu July 28, 2021, 9:51 p.m. UTC | #2
On Wed, Jul 28, 2021 at 09:04:30PM +0000, Sean Christopherson wrote:
> >  struct pte_list_desc {
> >  	u64 *sptes[PTE_LIST_EXT];
> > +	/*
> > +	 * Stores number of entries stored in the pte_list_desc.  No need to be
> > +	 * u64 but just for easier alignment.  When PTE_LIST_EXT, means full.
> > +	 */
> > +	u64 spte_count;
> 
> Per my feedback to the previous patch, this should be above sptes[] so that rmaps
> with <8 SPTEs only touch one cache line.  No idea if it actually matters in
> practice, but I can't see how it would harm anything.

Reasonable.  Not sure whether this would change the numbers a bit in the commit
message; it can be slightly better but also possible to be non-observable.
Paolo, let me know if you want me to repost/retest with the change (along with
keeping the comment in the other patch).

Thanks for looking!
Paolo Bonzini July 29, 2021, 9:33 a.m. UTC | #3
On 28/07/21 23:51, Peter Xu wrote:
> Reasonable.  Not sure whether this would change the numbers a bit in the commit
> message; it can be slightly better but also possible to be non-observable.
> Paolo, let me know if you want me to repost/retest with the change (along with
> keeping the comment in the other patch).

Yes, feel free please start from the patches in kvm/queue (there were 
some conflicts, so it will save you the rebasing work) and send v3 
according to Sean's callbacks.

Paolo
Peter Xu July 29, 2021, 3:53 p.m. UTC | #4
On Thu, Jul 29, 2021 at 11:33:32AM +0200, Paolo Bonzini wrote:
> On 28/07/21 23:51, Peter Xu wrote:
> > Reasonable.  Not sure whether this would change the numbers a bit in the commit
> > message; it can be slightly better but also possible to be non-observable.
> > Paolo, let me know if you want me to repost/retest with the change (along with
> > keeping the comment in the other patch).
> 
> Yes, feel free please start from the patches in kvm/queue (there were some
> conflicts, so it will save you the rebasing work) and send v3 according to
> Sean's callbacks.

Will do.  Thanks,
Peter Xu July 30, 2021, 3:45 p.m. UTC | #5
On Wed, Jul 28, 2021 at 09:04:30PM +0000, Sean Christopherson wrote:
> >  struct pte_list_desc {
> >  	u64 *sptes[PTE_LIST_EXT];
> > +	/*
> > +	 * Stores number of entries stored in the pte_list_desc.  No need to be
> > +	 * u64 but just for easier alignment.  When PTE_LIST_EXT, means full.
> > +	 */
> > +	u64 spte_count;
> 
> Per my feedback to the previous patch, this should be above sptes[] so that rmaps
> with <8 SPTEs only touch one cache line.  No idea if it actually matters in
> practice, but I can't see how it would harm anything.

Since at it, I'll further move "more" to be at the entry too, so I think it
optimizes full entries case too.

/*
 * Slight optimization of cacheline layout, by putting `more' and `spte_count'
 * at the start; then accessing it will only use one single cacheline for
 * either full (entries==PTE_LIST_EXT) case or entries<=6.
 */
struct pte_list_desc {
	struct pte_list_desc *more;
	/*
	 * Stores number of entries stored in the pte_list_desc.  No need to be
	 * u64 but just for easier alignment.  When PTE_LIST_EXT, means full.
	 */
	u64 spte_count;
	u64 *sptes[PTE_LIST_EXT];
};

Thanks,
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9b093985a2ef..ba0258bdebc4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -138,10 +138,15 @@  module_param(dbg, bool, 0644);
 #include <trace/events/kvm.h>
 
 /* make pte_list_desc fit well in cache lines */
-#define PTE_LIST_EXT 15
+#define PTE_LIST_EXT 14
 
 struct pte_list_desc {
 	u64 *sptes[PTE_LIST_EXT];
+	/*
+	 * Stores number of entries stored in the pte_list_desc.  No need to be
+	 * u64 but just for easier alignment.  When PTE_LIST_EXT, means full.
+	 */
+	u64 spte_count;
 	struct pte_list_desc *more;
 };
 
@@ -893,7 +898,7 @@  static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
-	int i, count = 0;
+	int count = 0;
 
 	if (!rmap_head->val) {
 		rmap_printk("%p %llx 0->1\n", spte, *spte);
@@ -903,24 +908,24 @@  static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 		desc = mmu_alloc_pte_list_desc(vcpu);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
+		desc->spte_count = 2;
 		rmap_head->val = (unsigned long)desc | 1;
 		++count;
 	} else {
 		rmap_printk("%p %llx many->many\n", spte, *spte);
 		desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
-		while (desc->sptes[PTE_LIST_EXT-1]) {
+		while (desc->spte_count == PTE_LIST_EXT) {
 			count += PTE_LIST_EXT;
-
 			if (!desc->more) {
 				desc->more = mmu_alloc_pte_list_desc(vcpu);
 				desc = desc->more;
+				desc->spte_count = 0;
 				break;
 			}
 			desc = desc->more;
 		}
-		for (i = 0; desc->sptes[i]; ++i)
-			++count;
-		desc->sptes[i] = spte;
+		count += desc->spte_count;
+		desc->sptes[desc->spte_count++] = spte;
 	}
 	return count;
 }
@@ -930,13 +935,12 @@  pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
 			   struct pte_list_desc *desc, int i,
 			   struct pte_list_desc *prev_desc)
 {
-	int j;
+	int j = desc->spte_count - 1;
 
-	for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
-		;
 	desc->sptes[i] = desc->sptes[j];
 	desc->sptes[j] = NULL;
-	if (j != 0)
+	desc->spte_count--;
+	if (desc->spte_count)
 		return;
 	if (!prev_desc && !desc->more)
 		rmap_head->val = 0;
@@ -969,7 +973,7 @@  static void __pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head)
 		desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
 		prev_desc = NULL;
 		while (desc) {
-			for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) {
+			for (i = 0; i < desc->spte_count; ++i) {
 				if (desc->sptes[i] == spte) {
 					pte_list_desc_remove_entry(rmap_head,
 							desc, i, prev_desc);
@@ -993,7 +997,7 @@  static void pte_list_remove(struct kvm_rmap_head *rmap_head, u64 *sptep)
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
-	unsigned int i, count = 0;
+	unsigned int count = 0;
 
 	if (!rmap_head->val)
 		return 0;
@@ -1003,8 +1007,7 @@  unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
 
 	while (desc) {
-		for (i = 0; (i < PTE_LIST_EXT) && desc->sptes[i]; i++)
-			count++;
+		count += desc->spte_count;
 		desc = desc->more;
 	}