From patchwork Wed Apr 12 01:17:29 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Emilio Cota X-Patchwork-Id: 9676389 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 1BCDC60382 for ; Wed, 12 Apr 2017 01:24:20 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 07C7A2852E for ; Wed, 12 Apr 2017 01:24:20 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id ED90D2858A; Wed, 12 Apr 2017 01:24:19 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id CC4022852E for ; Wed, 12 Apr 2017 01:24:18 +0000 (UTC) Received: from localhost ([::1]:41789 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cy717-00040D-TX for patchwork-qemu-devel@patchwork.kernel.org; Tue, 11 Apr 2017 21:24:17 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41290) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cy6v1-00088K-HV for qemu-devel@nongnu.org; Tue, 11 Apr 2017 21:18:04 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cy6uz-0006T9-JN for qemu-devel@nongnu.org; Tue, 11 Apr 2017 21:17:59 -0400 Received: from out1-smtp.messagingengine.com ([66.111.4.25]:46195) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cy6us-0006MM-31; Tue, 11 Apr 2017 21:17:50 -0400 Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.nyi.internal (Postfix) with ESMTP id 7AF1120BF7; Tue, 11 Apr 2017 21:17:48 -0400 (EDT) Received: from frontend1 ([10.202.2.160]) by compute4.internal (MEProxy); Tue, 11 Apr 2017 21:17:48 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=braap.org; h=cc :date:from:in-reply-to:message-id:references:subject:to :x-me-sender:x-me-sender:x-sasl-enc:x-sasl-enc; s=mesmtp; bh=VPR Yhwu83IUsd/Kj+jp04soBYNWiePOGn1IRbHBB87o=; b=K9vsObPHu1kGVXBgccT tQuq+Yh5cScyaX+rIbcZdksZ2BsutV7KMIjL0Xeh9fOGlbinWSJCWoeVz+84DBX8 VtWk4jwt9c4cAsgeUytI6mlaQcfSXuD057b8vmj2GN+nZi3+4Of0G2z5APrPrJ5E GGC3DYmm6FOKt8cMiJ7FjIAo= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:date:from:in-reply-to:message-id :references:subject:to:x-me-sender:x-me-sender:x-sasl-enc :x-sasl-enc; s=fm1; bh=VPRYhwu83IUsd/Kj+jp04soBYNWiePOGn1IRbHBB8 7o=; b=CXhTdt8NE95Q3DdT58kZ/T6kbGl0Sc3r4OYNdR+NADekre/FsXZtYVZsJ EEhX3u7QOE7ArZmAKFqP9DNPw7zBWJ58l02D531yPn8oG9qbKb6PhmH4vuwTcG+c FDMOtBCBQoE0oO494SCdLz+356CxzVZgkU1Y72A4Tz0/Thv5j3AwNPDWSQJ+aDyB DJ2tK7cFKl7eDH3+RoUDa3zmisAXI0vSyo48KlLOR22LsrABD3FtrjAMHH0fEWwT 0bvBHN3NPLqtE/+whSDrMJGirxl13/1n80cm65G+eotBzHkzmJ2cbULwtWcqubcd 9N8pkJZJAqEQeVDl/w+rXQU7qC/gw== X-ME-Sender: X-Sasl-enc: JnWji4Xm8guEmldgpjQA825Jx5d8cGmGMLLvPC+YIEX1 1491959868 Received: from localhost (flamenco.cs.columbia.edu [128.59.20.216]) by mail.messagingengine.com (Postfix) with ESMTPA id 2D22D7E31E; Tue, 11 Apr 2017 21:17:48 -0400 (EDT) From: "Emilio G. Cota" To: qemu-devel@nongnu.org Date: Tue, 11 Apr 2017 21:17:29 -0400 Message-Id: <1491959850-30756-10-git-send-email-cota@braap.org> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1491959850-30756-1-git-send-email-cota@braap.org> References: <1491959850-30756-1-git-send-email-cota@braap.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 66.111.4.25 Subject: [Qemu-devel] [PATCH 09/10] target/i386: optimize indirect branches with TCG's jr op X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Peter Maydell , Eduardo Habkost , Peter Crosthwaite , Stefan Weil , Claudio Fontana , Alexander Graf , alex.bennee@linaro.org, qemu-arm@nongnu.org, Pranith Kumar , Paolo Bonzini , Aurelien Jarno , Richard Henderson Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP Speed up indirect branches by adding a helper to look for the TB in tb_jmp_cache. The helper returns either the corresponding host address or NULL. Measurements: - NBench, x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz Y axis: Speedup over 95b31d70 1.1x+-+-------------------------------------------------------------+-+ | jr $$ | 1.08x+-+...... jr+inline %% ..................................+-+ | | | $$$ | 1.06x+-$.$............................%%%............................+-+ | $ $%% % % | 1.04x+-$.$.%..........................%.%............................+-+ | $ $ % $$$ % $$$ | | $ $ % %%% $ $ % $ $%% | 1.02x+-$.$.%.........%%%.$$.%.......$.$.%...%%%...%%.......$.$.%.$$$%%-+ | $ $ % % % $$ % $$$ $ $ % $$$ % %% $$$%% $ $ % $ $ % | 1x+-$.$B%R$$$ARGRA%H%T$$P%j$+$%%i$e$.%.$.$.%.$$$%.$.$.%.$.$.%.$.$.%-+ | $ $ % $ $%% $$$ % $$ % $ $ % $ $ % $ $ % $ $% $ $ % $ $ % $ $ % | 0.98x+-$.$.%.$.$.%.$.$.%.$$.%.$.$.%.$.$.%.$.$.%.$.$%.$.$.%.$.$.%.$.$.%-+ | $ $ % $ $ % $ $ % $$ % $ $ % $ $ % $ $ % $ $% $ $ % $ $ % $ $ % | | $ $ % $ $ % $ $ % $$ % $ $ % $ $ % $ $ % $ $% $ $ % $ $ % $ $ % | 0.96x+-$.$.%.$.$.%.$.$.%.$$.%.$.$.%.$.$.%.$.$.%.$.$%.$.$.%.$.$.%.$.$.%-+ +-$$$%%-$$$%%-$$$%%-$$%%-$$$%%-$$$%%-$$$%%-$$$%-$$$%%-$$$%%-$$$%%-+ ASSIGNMBITFIELFOFP_EMULATHUFFMANLU_DECOMPNEURNUMERICSTRING_SOhmean png: http://imgur.com/Jxj4hBd The fact that NBench is not very sensitive to changes here is a little surprising, especially given the significant improvements for ARM shown in the previous commit. I wonder whether the compiler is doing a better job compiling the x86_64 version (I'm using gcc 5.4.0), or I'm simply missing some i386 instructions to which the jr optimization should be applied. specINT 2006 (test set), x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz Y axis: Speedup over 95b31d70 1.3x+-+-------------------------------------------------------------+-+ | jr+inline $$ | 1.25x+-+.............................................................+-+ | | 1.2x+-+.............................................................+-+ | | | +++ +++ | 1.15x+-+...................$$$.................$$$...................+-+ | $ $ $:$ | 1.1x+-+...................$.$.................$.$...........$$$$....+-+ | +++ $ $ $ $ +++ $++$ | 1.05x+-+.........$$$$......$.$.................$.$...........$..$....+-+ | $ $ $ $ $$$ $ $ $$$$ $$$$ $ $ $$$$ | | $$$$ +++ $ $ +++ $ $ $ $ +++ $$$ $ $ $ $ $++$ $ $ $ $ | 1x+-$BA$G$$$$_$EM$_$$$$.$.$..$.$..$$$..$.$..$.$.$..$.$..$.$..$.$..$-+ | $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ | 0.95x+-$..$.$..$.$..$.$..$.$.$..$.$..$.$..$.$..$.$.$..$.$..$.$..$.$..$-+ | $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ | 0.9x+-$$$$-$$$$-$$$$-$$$$-$$$--$$$--$$$--$$$--$$$-$$$$-$$$$-$$$$-$$$$-+ astarbzip2gcc gobmh264rehmlibquantumcfomneperlbensjxalancbhmean png: http://imgur.com/63Ncmx8 That is a 4.4% hmean perf improvement. - specINT 2006 (train set), x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz Y axis: Speedup over 95b31d70 1.4x+-+--------------------------------------------------------------+-+ | jr $$ | | | 1.3x+-+..............................................................+-+ | | | | 1.2x+-+......................................................$$$$....+-+ | +++ $$$$ : $++$ | | $$$$ $$$$ $ $ : $ $ | 1.1x+-+...................$..$................$..$.$..$.$$$$.$..$....+-+ | $ $ $ $ $ $ $: $ $ $ +++ | | +++ +++ +++ $ $ $$$$ +++ $ $ $ $ $: $ $ $ $$$$ | 1x+-$$$$GRAPH_$$$$_$$$$.$..$.$..$.$$$$......$..$.$..$.$..$.$..$.$..$-+ | $++$ $$$$ $ $ $++$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ | | $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ | 0.9x+-$..$.$..$.$..$.$..$.$..$.$..$.$..$......$..$.$..$.$..$.$..$.$..$-+ | $ $ $ $ $ $ $ $ $ $ $ $ $ $ $$$$ $ $ $ $ $ $ $ $ $ $ | | $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ | 0.8x+-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-+ astarbzip2 gcc gobmh264rehmlibquantmcfomneperlbensjexalancbhmean png: http://imgur.com/hd0BhU6 That is, a 4.39 % hmean improvement for jr+inline, i.e. this commit. (4.5% for noinline). Peak improvement is 20% for xalancbmk. - specINT 2006 (test set), x86_64-softmmu. Host: Intel i7-4790K @ 4.00GHz Y axis: Speedup over 95b31d70 1.3x+-+-------------------------------------------------------------+-+ | cross $$ | 1.25x+-+..... jr %% .........................................+-+ | cross+jr @@ : | 1.2x+-+.............................................................+-+ | : : | | +++ : : | 1.15x+-+...........@@................................................+-+ | $$@@ $$++ +++ : @@ | 1.1x+-+.........$$@@.$$@@.....................................@@....+-+ | $$@@ $$@@ $$ : @@@ +++$$@@ | 1.05x+-+.........$$@@.$$@@...@@...............$$...$$@.@.....$$@@....+-+ | +++$$%@ $$@@ %%@+++++++++++++++$$+: $$@ @++@@ $$%@+$$@@+| | +@@+++@@+$$%@ $$@@++%%@$$$% ::@@ ::@@$$@@@$$% @$$@@ $$%@+$$@@ | 1x+-$$%@A$$%@R$$%@R$$%@$$$%@$_$%@s%%%@$$%%@$$@.@$$%.@$$@@.$$%@.$$%@-+ |+$$%@ $$%@ $$%@ $$%@$ $%@$+$%@ %+%@$$+%@$$@+@$$% @$$@@ $$%@+$$%@ | 0.95x+-$$%@.$$%@.$$%@.$$%@$.$%@$.$%@$$.%@$$.%@$$@.@$$%.@$$%@.$$%@.$$%@-+ | $$%@ $$%@ $$%@ $$%@$ $%@$ $%@$$ %@$$ %@$$%+@$$% @$$%@ $$%@ $$%@ | 0.9x+-$$%@-$$%@-$$%@-$$%@$$$%@$$$%@$$%%@$$%%@$$%@@$$%@@$$%@-$$%@-$$%@-+ astabzip2 gcc gobmh264rehmlibquantumcfomneperlbensjexalanchmean png: http://imgur.com/IV9UtSa Here we see how jr works best when combined with cross -- jr by itself is disappointingly around baseline performance. I attribute this to the frequent page invalidations and/or TLB flushes (I'm running Ubuntu 16.04 as the guest, so there are many processes), which lowers the maximum attainable hit rate in tb_jmp_cache. Overall the greatest hmean improvement comes from cross+jr though. - specINT 2006 (train set), x86_64-softmmu. Host: Intel i7-4790K @ 4.00GHz Y axis: Speedup over 95b31d70 1.25x+-+-------------------------------------------------------------+-+ | cross+inline $$ | | cross+jr+inline %% +++ +++ | 1.2x+-+.............................................................+-+ | : : +++ | 1.15x+-+.......................................................%%....+-+ | :: +++ $$$ $$$% $$$% | | $$%%++%%% $:$ $+$% +++ $:$% | 1.1x+-+.........$$.%.$$.%....................$.$..$.$%......$.$%....+-+ | +++ $$+%+$$ %+++++ :+++ $ $: $ $% :%% $+$% +++ | 1.05x+-+....$$...$$.%.$$.%......$$............$.$%.$.$%.$$$%.$.$%.$$%%-+ | $$%% $$ % $$ % $$%% $$: +++ $ $% $ $% $:$% $ $% $$+% | | $$+% $$ % $$ % $$:%+$$%%+++: +++ $ $%+$ $% $:$% $ $% $$ % | 1x+-$$$AR$$A%G$$P%_$$M%_$$o%s$$r%$$$%%e....$.$%.$.$%.$.$%.$.$%.$$.%-+ | $+$% $$ % $$ % $$ %+$$+% $$:%$:$+%$$$++$ $% $ $% $ $% $ $% $$ % | 0.95x+-$.$%.$$.%.$$.%.$$.%.$$.%.$$.%$.$.%$.$..$.$%.$.$%.$.$%.$.$%.$$.%-+ | $ $% $$ % $$ % $$ % $$ % $$ %$ $ %$+$% $ $% $ $% $ $% $ $% $$ % | | $ $% $$ % $$ % $$ % $$ % $$ %$ $ %$ $% $ $% $ $% $ $% $ $% $$ % | 0.9x+-$$$%-$$%%-$$%%-$$%%-$$%%-$$%%$$$%%$$$%-$$$%-$$$%-$$$%-$$$%-$$%%-+ astabzip2 gcc gobmh264rehmlibquantumcfomneperlbensjexalanchmean png: http://imgur.com/CBMxrBH This is the larger "train" set of SPECint06. Here cross+jr comes slightly below cross, but it's within the noise margins (I didn't run this many times, since it takes several hours). Signed-off-by: Emilio G. Cota --- target/i386/helper.h | 1 + target/i386/misc_helper.c | 11 +++++++++++ target/i386/translate.c | 42 +++++++++++++++++++++++++++++++++--------- 3 files changed, 45 insertions(+), 9 deletions(-) diff --git a/target/i386/helper.h b/target/i386/helper.h index dceb343..f7e9f9c 100644 --- a/target/i386/helper.h +++ b/target/i386/helper.h @@ -2,6 +2,7 @@ DEF_HELPER_FLAGS_4(cc_compute_all, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int) DEF_HELPER_FLAGS_4(cc_compute_c, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int) DEF_HELPER_2(cross_page_check, i32, env, tl) +DEF_HELPER_2(get_hostptr, ptr, env, tl) DEF_HELPER_3(write_eflags, void, env, tl, i32) DEF_HELPER_1(read_eflags, tl, env) diff --git a/target/i386/misc_helper.c b/target/i386/misc_helper.c index a41daed..5d50ab0 100644 --- a/target/i386/misc_helper.c +++ b/target/i386/misc_helper.c @@ -642,3 +642,14 @@ uint32_t helper_cross_page_check(CPUX86State *env, target_ulong vaddr) { return !!tb_from_jmp_cache(env, vaddr); } + +void *helper_get_hostptr(CPUX86State *env, target_ulong vaddr) +{ + TranslationBlock *tb; + + tb = tb_from_jmp_cache(env, vaddr); + if (unlikely(tb == NULL)) { + return NULL; + } + return tb->tc_ptr; +} diff --git a/target/i386/translate.c b/target/i386/translate.c index ffc8ccc..aab5c13 100644 --- a/target/i386/translate.c +++ b/target/i386/translate.c @@ -2521,7 +2521,8 @@ static void gen_bnd_jmp(DisasContext *s) If INHIBIT, set HF_INHIBIT_IRQ_MASK if it isn't already set. If RECHECK_TF, emit a rechecking helper for #DB, ignoring the state of S->TF. This is used by the syscall/sysret insns. */ -static void gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf) +static void +gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf, TCGv jr) { gen_update_cc_op(s); @@ -2542,6 +2543,22 @@ static void gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf) tcg_gen_exit_tb(0); } else if (s->tf) { gen_helper_single_step(cpu_env); + } else if (jr) { +#ifdef TCG_TARGET_HAS_JR + TCGLabel *label = gen_new_label(); + TCGv_ptr ptr = tcg_temp_local_new_ptr(); + TCGv vaddr = tcg_temp_new(); + + tcg_gen_ld_tl(vaddr, cpu_env, offsetof(CPUX86State, segs[R_CS].base)); + tcg_gen_add_tl(vaddr, vaddr, jr); + gen_helper_get_hostptr(ptr, cpu_env, vaddr); + tcg_temp_free(vaddr); + tcg_gen_brcondi_ptr(TCG_COND_EQ, ptr, NULL, label); + tcg_gen_jr(ptr); + tcg_temp_free_ptr(ptr); + gen_set_label(label); +#endif + tcg_gen_exit_tb(0); } else { tcg_gen_exit_tb(0); } @@ -2552,13 +2569,18 @@ static void gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf) If INHIBIT, set HF_INHIBIT_IRQ_MASK if it isn't already set. */ static void gen_eob_inhibit_irq(DisasContext *s, bool inhibit) { - gen_eob_worker(s, inhibit, false); + gen_eob_worker(s, inhibit, false, NULL); } /* End of block, resetting the inhibit irq flag. */ static void gen_eob(DisasContext *s) { - gen_eob_worker(s, false, false); + gen_eob_worker(s, false, false, NULL); +} + +static void gen_jr(DisasContext *s, TCGv dest) +{ + gen_eob_worker(s, false, false, dest); } /* generate a jump to eip. No segment change must happen before as a @@ -4985,7 +5007,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s, gen_push_v(s, cpu_T1); gen_op_jmp_v(cpu_T0); gen_bnd_jmp(s); - gen_eob(s); + gen_jr(s, cpu_T0); break; case 3: /* lcall Ev */ gen_op_ld_v(s, ot, cpu_T1, cpu_A0); @@ -5003,7 +5025,8 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s, tcg_const_i32(dflag - 1), tcg_const_i32(s->pc - s->cs_base)); } - gen_eob(s); + tcg_gen_ld_tl(cpu_tmp4, cpu_env, offsetof(CPUX86State, eip)); + gen_jr(s, cpu_tmp4); break; case 4: /* jmp Ev */ if (dflag == MO_16) { @@ -5011,7 +5034,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s, } gen_op_jmp_v(cpu_T0); gen_bnd_jmp(s); - gen_eob(s); + gen_jr(s, cpu_T0); break; case 5: /* ljmp Ev */ gen_op_ld_v(s, ot, cpu_T1, cpu_A0); @@ -5026,7 +5049,8 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s, gen_op_movl_seg_T0_vm(R_CS); gen_op_jmp_v(cpu_T1); } - gen_eob(s); + tcg_gen_ld_tl(cpu_tmp4, cpu_env, offsetof(CPUX86State, eip)); + gen_jr(s, cpu_tmp4); break; case 6: /* push Ev */ gen_push_v(s, cpu_T0); @@ -7143,7 +7167,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s, /* TF handling for the syscall insn is different. The TF bit is checked after the syscall insn completes. This allows #DB to not be generated after one has entered CPL0 if TF is set in FMASK. */ - gen_eob_worker(s, false, true); + gen_eob_worker(s, false, true, NULL); break; case 0x107: /* sysret */ if (!s->pe) { @@ -7158,7 +7182,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s, checked after the sysret insn completes. This allows #DB to be generated "as if" the syscall insn in userspace has just completed. */ - gen_eob_worker(s, false, true); + gen_eob_worker(s, false, true, NULL); } break; #endif