[RFC] tcg/softmmu: Increase size of TLB cache

Message ID	20170724210306.18428-1-bobby.prani@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Pranith Kumar <bobby.prani@gmail.com> To: alex.bennee@linaro.org Date: Mon, 24 Jul 2017 17:03:06 -0400 Message-Id: <20170724210306.18428-1-bobby.prani@gmail.com> Subject: [Qemu-devel] [RFC PATCH] tcg/softmmu: Increase size of TLB cache Precedence: list Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, rth@twiddle.net Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Message ID

20170724210306.18428-1-bobby.prani@gmail.com (mailing list archive)

State

New, archived

Headers

From: Pranith Kumar <bobby.prani@gmail.com>
To: alex.bennee@linaro.org
Date: Mon, 24 Jul 2017 17:03:06 -0400
Message-Id: <20170724210306.18428-1-bobby.prani@gmail.com>
Subject: [Qemu-devel] [RFC PATCH] tcg/softmmu: Increase size of TLB cache
Precedence: list
Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, rth@twiddle.net
Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Commit Message

Pranith Kumar July 24, 2017, 9:03 p.m. UTC

This patch increases the number of entries we allow in the TLB. I went
over a few architectures to see if increasing it is problematic. Only
armv6 seems to have a limitation that only 8 bits can be used for
indexing these entries. For other architectures, I increased the
number of TLB entries to a 4K-sized cache.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
---
 include/exec/cpu-defs.h  | 5 ++++-
 tcg/aarch64/tcg-target.h | 1 +
 tcg/i386/tcg-target.h    | 2 ++
 tcg/mips/tcg-target.h    | 1 +
 tcg/s390/tcg-target.h    | 1 +
 tcg/sparc/tcg-target.h   | 1 +
 6 files changed, 10 insertions(+), 1 deletion(-)

Comments

Paolo Bonzini July 24, 2017, 9:19 p.m. UTC | #1

On 24/07/2017 23:03, Pranith Kumar wrote:
> This patch increases the number of entries we allow in the TLB. I went
> over a few architectures to see if increasing it is problematic. Only
> armv6 seems to have a limitation that only 8 bits can be used for
> indexing these entries. For other architectures, I increased the
> number of TLB entries to a 4K-sized cache.
> 
> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>

How did you benchmark this, and can you plot (at least for x86 hosts)
the results as CPU_TLB_BITS_MAX grows from 8 to 12?

Thanks,

Paolo

Richard Henderson July 24, 2017, 9:45 p.m. UTC | #2

On 07/24/2017 02:03 PM, Pranith Kumar wrote:
>   
> +#ifndef CPU_TLB_BITS_MAX
> +# define CPU_TLB_BITS_MAX 8

You should simply require each backend to define this.

> +++ b/tcg/i386/tcg-target.h
> @@ -162,6 +162,8 @@ extern bool have_popcnt;
>  # define TCG_AREG0 TCG_REG_EBP
>  #endif
>  
> +#define CPU_TLB_BITS_MAX 12

This is probably too much.

Exemplars:
NB_MMU_MODES = 1	moxie
NB_MMU_MODES = 2	m68k
NB_MMU_MODES = 3	alpha
NB_MMU_MODES = 7	arm
NB_MMU_MODES = 8	ppc64

sizeof(CPUArchState):

   tlb bits \ modes	1	2	3	7	8
	8		13856	25840	38952	92024	182576
	12		198176	394480	591912	1382264	1657136

Having 1.5MB of TLB data seems excessive.

Please let's get some performance numbers for various tlb bit sizes.

How much improvement do you get if you increase the size of the victim tlb cache?

r~

Alex Bennée July 28, 2017, 8:17 a.m. UTC | #3

Paolo Bonzini <pbonzini@redhat.com> writes:

> On 24/07/2017 23:03, Pranith Kumar wrote:
>> This patch increases the number of entries we allow in the TLB. I went
>> over a few architectures to see if increasing it is problematic. Only
>> armv6 seems to have a limitation that only 8 bits can be used for
>> indexing these entries. For other architectures, I increased the
>> number of TLB entries to a 4K-sized cache.
>>
>> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
>
> How did you benchmark this, and can you plot (at least for x86 hosts)
> the results as CPU_TLB_BITS_MAX grows from 8 to 12?

Pranith has some numbers but what we were seeing is the re-fill path
creeping up the perf profiles. Because it is so expensive to re-compute
the entries pushing up the TLB size does ameliorate the problem.

That said I don't think increasing the TLB size is our only solution.
What I've asked for is some sort of idea of the pattern for the eviction
of entries from the TLB and the performance of the victim cache. It may
be tweaking the locality of that cache would be enough.

One idea I had was With an 8 bit TLB you could afford to have 256
dynamically grown arrays in the victim path - one per entry. Then at
flush time you could simply count up the number of victims in the array
for that slot. That would give you a good idea if some regions are
hotter than others.

--
Alex Bennée

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 29b3c2ada8..cb81232b83 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -64,6 +64,9 @@  typedef uint64_t target_ulong;
 #define CPU_TLB_ENTRY_BITS 5
 #endif
 
+#ifndef CPU_TLB_BITS_MAX
+# define CPU_TLB_BITS_MAX 8
+#endif
 /* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that
  * the TLB is not unnecessarily small, but still small enough for the
  * TLB lookup instruction sequence used by the TCG target.
@@ -87,7 +90,7 @@  typedef uint64_t target_ulong;
  * of tlb_table inside env (which is non-trivial but not huge).
  */
 #define CPU_TLB_BITS                                             \
-    MIN(8,                                                       \
+    MIN(CPU_TLB_BITS_MAX,                                        \
         TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS -  \
         (NB_MMU_MODES <= 1 ? 0 :                                 \
          NB_MMU_MODES <= 2 ? 1 :                                 \
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 55a46ac825..f428e09c98 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -15,6 +15,7 @@ 
 
 #define TCG_TARGET_INSN_UNIT_SIZE  4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
+#define CPU_TLB_BITS_MAX 12
 #undef TCG_TARGET_STACK_GROWSUP
 
 typedef enum {
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 73a15f7e80..35c27a977b 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -162,6 +162,8 @@  extern bool have_popcnt;
 # define TCG_AREG0 TCG_REG_EBP
 #endif
 
+#define CPU_TLB_BITS_MAX 12
+
 static inline void flush_icache_range(uintptr_t start, uintptr_t stop)
 {
 }
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index d75cb63ed3..fd9046b7ad 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -37,6 +37,7 @@ 
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define CPU_TLB_BITS_MAX 12
 #define TCG_TARGET_NB_REGS 32
 
 typedef enum {
diff --git a/tcg/s390/tcg-target.h b/tcg/s390/tcg-target.h
index 957f0c0afe..218be322ad 100644
--- a/tcg/s390/tcg-target.h
+++ b/tcg/s390/tcg-target.h
@@ -27,6 +27,7 @@ 
 
 #define TCG_TARGET_INSN_UNIT_SIZE 2
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
+#define CPU_TLB_BITS_MAX 12
 
 typedef enum TCGReg {
     TCG_REG_R0 = 0,
diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc/tcg-target.h
index 854a0afd70..9fd59a64f2 100644
--- a/tcg/sparc/tcg-target.h
+++ b/tcg/sparc/tcg-target.h
@@ -29,6 +29,7 @@ 
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
+#define CPU_TLB_BITS_MAX 12
 #define TCG_TARGET_NB_REGS 32
 
 typedef enum {

[RFC] tcg/softmmu: Increase size of TLB cache

Commit Message

Comments

Patch