[09/10] target/i386: optimize indirect branches with TCG's jr op

Speed up indirect branches by adding a helper to look for the
TB in tb_jmp_cache. The helper returns either the corresponding
host address or NULL.

Measurements:

-             NBench, x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

     1.1x+-+-------------------------------------------------------------+-+
         |          jr             $$                                      |
    1.08x+-+......  jr+inline      %%  ..................................+-+
         |                                                                 |
         | $$$                                                             |
    1.06x+-$.$............................%%%............................+-+
         | $ $%%                          % %                              |
    1.04x+-$.$.%..........................%.%............................+-+
         | $ $ %                        $$$ %                  $$$         |
         | $ $ %              %%%       $ $ %                  $ $%%       |
    1.02x+-$.$.%.........%%%.$$.%.......$.$.%...%%%...%%.......$.$.%.$$$%%-+
         | $ $ %         % % $$ % $$$   $ $ % $$$ %   %% $$$%% $ $ % $ $ % |
       1x+-$.$B%R$$$ARGRA%H%T$$P%j$+$%%i$e$.%.$.$.%.$$$%.$.$.%.$.$.%.$.$.%-+
         | $ $ % $ $%% $$$ % $$ % $ $ % $ $ % $ $ % $ $% $ $ % $ $ % $ $ % |
    0.98x+-$.$.%.$.$.%.$.$.%.$$.%.$.$.%.$.$.%.$.$.%.$.$%.$.$.%.$.$.%.$.$.%-+
         | $ $ % $ $ % $ $ % $$ % $ $ % $ $ % $ $ % $ $% $ $ % $ $ % $ $ % |
         | $ $ % $ $ % $ $ % $$ % $ $ % $ $ % $ $ % $ $% $ $ % $ $ % $ $ % |
    0.96x+-$.$.%.$.$.%.$.$.%.$$.%.$.$.%.$.$.%.$.$.%.$.$%.$.$.%.$.$.%.$.$.%-+
         +-$$$%%-$$$%%-$$$%%-$$%%-$$$%%-$$$%%-$$$%%-$$$%-$$$%%-$$$%%-$$$%%-+
        ASSIGNMBITFIELFOFP_EMULATHUFFMANLU_DECOMPNEURNUMERICSTRING_SOhmean
  png: http://imgur.com/Jxj4hBd

The fact that NBench is not very sensitive to changes here is a
little surprising, especially given the significant improvements for
ARM shown in the previous commit. I wonder whether the compiler is doing
a better job compiling the x86_64 version (I'm using gcc 5.4.0), or I'm simply
missing some i386 instructions to which the jr optimization should
be applied.

     specINT 2006 (test set), x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

     1.3x+-+-------------------------------------------------------------+-+
         |          jr+inline $$                                           |
    1.25x+-+.............................................................+-+
         |                                                                 |
     1.2x+-+.............................................................+-+
         |                                                                 |
         |                     +++                 +++                     |
    1.15x+-+...................$$$.................$$$...................+-+
         |                     $ $                 $:$                     |
     1.1x+-+...................$.$.................$.$...........$$$$....+-+
         |           +++       $ $                 $ $       +++ $++$      |
    1.05x+-+.........$$$$......$.$.................$.$...........$..$....+-+
         |           $  $      $ $  $$$            $ $ $$$$ $$$$ $  $ $$$$ |
         | $$$$  +++ $  $ +++  $ $  $ $  +++  $$$  $ $ $  $ $++$ $  $ $  $ |
       1x+-$BA$G$$$$_$EM$_$$$$.$.$..$.$..$$$..$.$..$.$.$..$.$..$.$..$.$..$-+
         | $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ |
    0.95x+-$..$.$..$.$..$.$..$.$.$..$.$..$.$..$.$..$.$.$..$.$..$.$..$.$..$-+
         | $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ |
     0.9x+-$$$$-$$$$-$$$$-$$$$-$$$--$$$--$$$--$$$--$$$-$$$$-$$$$-$$$$-$$$$-+
           astarbzip2gcc gobmh264rehmlibquantumcfomneperlbensjxalancbhmean
  png: http://imgur.com/63Ncmx8

That is a 4.4% hmean perf improvement.

-  specINT 2006 (train set), x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

    1.4x+-+--------------------------------------------------------------+-+
        |        jr  $$                                                    |
        |                                                                  |
    1.3x+-+..............................................................+-+
        |                                                                  |
        |                                                                  |
    1.2x+-+......................................................$$$$....+-+
        |                      +++                     $$$$  :   $++$      |
        |                     $$$$                $$$$ $  $  :   $  $      |
    1.1x+-+...................$..$................$..$.$..$.$$$$.$..$....+-+
        |                     $  $                $  $ $  $ $: $ $  $ +++  |
        |  +++       +++  +++ $  $ $$$$  +++      $  $ $  $ $: $ $  $ $$$$ |
      1x+-$$$$GRAPH_$$$$_$$$$.$..$.$..$.$$$$......$..$.$..$.$..$.$..$.$..$-+
        | $++$ $$$$ $  $ $++$ $  $ $  $ $  $      $  $ $  $ $  $ $  $ $  $ |
        | $  $ $  $ $  $ $  $ $  $ $  $ $  $      $  $ $  $ $  $ $  $ $  $ |
    0.9x+-$..$.$..$.$..$.$..$.$..$.$..$.$..$......$..$.$..$.$..$.$..$.$..$-+
        | $  $ $  $ $  $ $  $ $  $ $  $ $  $ $$$$ $  $ $  $ $  $ $  $ $  $ |
        | $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ |
    0.8x+-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-+
          astarbzip2 gcc gobmh264rehmlibquantmcfomneperlbensjexalancbhmean
  png: http://imgur.com/hd0BhU6

That is, a 4.39 % hmean improvement for jr+inline, i.e. this commit.
(4.5% for noinline). Peak improvement is 20% for xalancbmk.

-    specINT 2006 (test set), x86_64-softmmu. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

     1.3x+-+-------------------------------------------------------------+-+
         |         cross    $$                                             |
    1.25x+-+.....  jr       %%  .........................................+-+
         |         cross+jr @@                                      :      |
     1.2x+-+.............................................................+-+
         |                                                       :  :      |
         |             +++                                       :  :      |
    1.15x+-+...........@@................................................+-+
         |           $$@@ $$++                    +++            : @@      |
     1.1x+-+.........$$@@.$$@@.....................................@@....+-+
         |           $$@@ $$@@                    $$ :   @@@  +++$$@@      |
    1.05x+-+.........$$@@.$$@@...@@...............$$...$$@.@.....$$@@....+-+
         |        +++$$%@ $$@@  %%@+++++++++++++++$$+: $$@ @++@@ $$%@+$$@@+|
         |  +@@+++@@+$$%@ $$@@++%%@$$$%  ::@@ ::@@$$@@@$$% @$$@@ $$%@+$$@@ |
       1x+-$$%@A$$%@R$$%@R$$%@$$$%@$_$%@s%%%@$$%%@$$@.@$$%.@$$@@.$$%@.$$%@-+
         |+$$%@ $$%@ $$%@ $$%@$ $%@$+$%@ %+%@$$+%@$$@+@$$% @$$@@ $$%@+$$%@ |
    0.95x+-$$%@.$$%@.$$%@.$$%@$.$%@$.$%@$$.%@$$.%@$$@.@$$%.@$$%@.$$%@.$$%@-+
         | $$%@ $$%@ $$%@ $$%@$ $%@$ $%@$$ %@$$ %@$$%+@$$% @$$%@ $$%@ $$%@ |
     0.9x+-$$%@-$$%@-$$%@-$$%@$$$%@$$$%@$$%%@$$%%@$$%@@$$%@@$$%@-$$%@-$$%@-+
           astabzip2 gcc gobmh264rehmlibquantumcfomneperlbensjexalanchmean
  png: http://imgur.com/IV9UtSa

Here we see how jr works best when combined with cross -- jr by itself is
disappointingly around baseline performance. I attribute this to the frequent
page invalidations and/or TLB flushes (I'm running Ubuntu 16.04 as the guest,
so there are many processes), which lowers the maximum attainable hit rate in
tb_jmp_cache.

Overall the greatest hmean improvement comes from cross+jr though.

-      specINT 2006 (train set), x86_64-softmmu. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

    1.25x+-+-------------------------------------------------------------+-+
         |        cross+inline    $$                                       |
         |        cross+jr+inline %%                     +++      +++      |
     1.2x+-+.............................................................+-+
         |                                         :      :      +++       |
    1.15x+-+.......................................................%%....+-+
         |            ::   +++                    $$$  $$$%      $$$%      |
         |           $$%%++%%%                    $:$  $+$% +++  $:$%      |
     1.1x+-+.........$$.%.$$.%....................$.$..$.$%......$.$%....+-+
         |      +++  $$+%+$$ %+++++ :+++          $ $: $ $%  :%% $+$% +++  |
    1.05x+-+....$$...$$.%.$$.%......$$............$.$%.$.$%.$$$%.$.$%.$$%%-+
         |      $$%% $$ % $$ % $$%% $$:   +++     $ $% $ $% $:$% $ $% $$+% |
         |      $$+% $$ % $$ % $$:%+$$%%+++: +++  $ $%+$ $% $:$% $ $% $$ % |
       1x+-$$$AR$$A%G$$P%_$$M%_$$o%s$$r%$$$%%e....$.$%.$.$%.$.$%.$.$%.$$.%-+
         | $+$% $$ % $$ % $$ %+$$+% $$:%$:$+%$$$++$ $% $ $% $ $% $ $% $$ % |
    0.95x+-$.$%.$$.%.$$.%.$$.%.$$.%.$$.%$.$.%$.$..$.$%.$.$%.$.$%.$.$%.$$.%-+
         | $ $% $$ % $$ % $$ % $$ % $$ %$ $ %$+$% $ $% $ $% $ $% $ $% $$ % |
         | $ $% $$ % $$ % $$ % $$ % $$ %$ $ %$ $% $ $% $ $% $ $% $ $% $$ % |
     0.9x+-$$$%-$$%%-$$%%-$$%%-$$%%-$$%%$$$%%$$$%-$$$%-$$$%-$$$%-$$$%-$$%%-+
           astabzip2 gcc gobmh264rehmlibquantumcfomneperlbensjexalanchmean
  png: http://imgur.com/CBMxrBH

This is the larger "train" set of SPECint06. Here cross+jr comes slightly
below cross, but it's within the noise margins (I didn't run this many
times, since it takes several hours).

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/i386/helper.h      |  1 +
 target/i386/misc_helper.c | 11 +++++++++++
 target/i386/translate.c   | 42 +++++++++++++++++++++++++++++++++---------
 3 files changed, 45 insertions(+), 9 deletions(-)

[09/10] target/i386: optimize indirect branches with TCG's jr op

Commit Message

Comments

Patch