Message ID | 20170724210306.18428-1-bobby.prani@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 24/07/2017 23:03, Pranith Kumar wrote: > This patch increases the number of entries we allow in the TLB. I went > over a few architectures to see if increasing it is problematic. Only > armv6 seems to have a limitation that only 8 bits can be used for > indexing these entries. For other architectures, I increased the > number of TLB entries to a 4K-sized cache. > > Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> How did you benchmark this, and can you plot (at least for x86 hosts) the results as CPU_TLB_BITS_MAX grows from 8 to 12? Thanks, Paolo
On 07/24/2017 02:03 PM, Pranith Kumar wrote: > > +#ifndef CPU_TLB_BITS_MAX > +# define CPU_TLB_BITS_MAX 8 You should simply require each backend to define this. > +++ b/tcg/i386/tcg-target.h > @@ -162,6 +162,8 @@ extern bool have_popcnt; > # define TCG_AREG0 TCG_REG_EBP > #endif > > +#define CPU_TLB_BITS_MAX 12 This is probably too much. Exemplars: NB_MMU_MODES = 1 moxie NB_MMU_MODES = 2 m68k NB_MMU_MODES = 3 alpha NB_MMU_MODES = 7 arm NB_MMU_MODES = 8 ppc64 sizeof(CPUArchState): tlb bits \ modes 1 2 3 7 8 8 13856 25840 38952 92024 182576 12 198176 394480 591912 1382264 1657136 Having 1.5MB of TLB data seems excessive. Please let's get some performance numbers for various tlb bit sizes. How much improvement do you get if you increase the size of the victim tlb cache? r~
Paolo Bonzini <pbonzini@redhat.com> writes: > On 24/07/2017 23:03, Pranith Kumar wrote: >> This patch increases the number of entries we allow in the TLB. I went >> over a few architectures to see if increasing it is problematic. Only >> armv6 seems to have a limitation that only 8 bits can be used for >> indexing these entries. For other architectures, I increased the >> number of TLB entries to a 4K-sized cache. >> >> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> > > How did you benchmark this, and can you plot (at least for x86 hosts) > the results as CPU_TLB_BITS_MAX grows from 8 to 12? Pranith has some numbers but what we were seeing is the re-fill path creeping up the perf profiles. Because it is so expensive to re-compute the entries pushing up the TLB size does ameliorate the problem. That said I don't think increasing the TLB size is our only solution. What I've asked for is some sort of idea of the pattern for the eviction of entries from the TLB and the performance of the victim cache. It may be tweaking the locality of that cache would be enough. One idea I had was With an 8 bit TLB you could afford to have 256 dynamically grown arrays in the victim path - one per entry. Then at flush time you could simply count up the number of victims in the array for that slot. That would give you a good idea if some regions are hotter than others. -- Alex Bennée
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h index 29b3c2ada8..cb81232b83 100644 --- a/include/exec/cpu-defs.h +++ b/include/exec/cpu-defs.h @@ -64,6 +64,9 @@ typedef uint64_t target_ulong; #define CPU_TLB_ENTRY_BITS 5 #endif +#ifndef CPU_TLB_BITS_MAX +# define CPU_TLB_BITS_MAX 8 +#endif /* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that * the TLB is not unnecessarily small, but still small enough for the * TLB lookup instruction sequence used by the TCG target. @@ -87,7 +90,7 @@ typedef uint64_t target_ulong; * of tlb_table inside env (which is non-trivial but not huge). */ #define CPU_TLB_BITS \ - MIN(8, \ + MIN(CPU_TLB_BITS_MAX, \ TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS - \ (NB_MMU_MODES <= 1 ? 0 : \ NB_MMU_MODES <= 2 ? 1 : \ diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h index 55a46ac825..f428e09c98 100644 --- a/tcg/aarch64/tcg-target.h +++ b/tcg/aarch64/tcg-target.h @@ -15,6 +15,7 @@ #define TCG_TARGET_INSN_UNIT_SIZE 4 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 24 +#define CPU_TLB_BITS_MAX 12 #undef TCG_TARGET_STACK_GROWSUP typedef enum { diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h index 73a15f7e80..35c27a977b 100644 --- a/tcg/i386/tcg-target.h +++ b/tcg/i386/tcg-target.h @@ -162,6 +162,8 @@ extern bool have_popcnt; # define TCG_AREG0 TCG_REG_EBP #endif +#define CPU_TLB_BITS_MAX 12 + static inline void flush_icache_range(uintptr_t start, uintptr_t stop) { } diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h index d75cb63ed3..fd9046b7ad 100644 --- a/tcg/mips/tcg-target.h +++ b/tcg/mips/tcg-target.h @@ -37,6 +37,7 @@ #define TCG_TARGET_INSN_UNIT_SIZE 4 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16 +#define CPU_TLB_BITS_MAX 12 #define TCG_TARGET_NB_REGS 32 typedef enum { diff --git a/tcg/s390/tcg-target.h b/tcg/s390/tcg-target.h index 957f0c0afe..218be322ad 100644 --- a/tcg/s390/tcg-target.h +++ b/tcg/s390/tcg-target.h @@ -27,6 +27,7 @@ #define TCG_TARGET_INSN_UNIT_SIZE 2 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 19 +#define CPU_TLB_BITS_MAX 12 typedef enum TCGReg { TCG_REG_R0 = 0, diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc/tcg-target.h index 854a0afd70..9fd59a64f2 100644 --- a/tcg/sparc/tcg-target.h +++ b/tcg/sparc/tcg-target.h @@ -29,6 +29,7 @@ #define TCG_TARGET_INSN_UNIT_SIZE 4 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 32 +#define CPU_TLB_BITS_MAX 12 #define TCG_TARGET_NB_REGS 32 typedef enum {
This patch increases the number of entries we allow in the TLB. I went over a few architectures to see if increasing it is problematic. Only armv6 seems to have a limitation that only 8 bits can be used for indexing these entries. For other architectures, I increased the number of TLB entries to a 4K-sized cache. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> --- include/exec/cpu-defs.h | 5 ++++- tcg/aarch64/tcg-target.h | 1 + tcg/i386/tcg-target.h | 2 ++ tcg/mips/tcg-target.h | 1 + tcg/s390/tcg-target.h | 1 + tcg/sparc/tcg-target.h | 1 + 6 files changed, 10 insertions(+), 1 deletion(-)